Change Kubernetes Persistent Volume's Storage Class and keep the data - kubernetes

I have Elasticsearch Data pods that are currently running on an AKS and are connected to Persistent Volumes that is using a Premium SSD Managed Disk Storage Class and I want to downgrade it to Standard SSD Managed Disk without losing the data I have on the currently used Persistent Volume.
I've created a new Storage Class that is defined with Standard SSD Managed Disk, but if I create a new PV from that it obviously doesn't keep the old data and I need to copy it somehow, so I was wondering what would be best practice switching PV's Storage Class.

Unfortunately, once a PVC is created and a PV is provisioned for it, the only thing you can change without creating a new one is the volume's size
The only straightforward way I could think of without leveraging CSI snapshots/clones, which you might not have access to (depends on how you created PVCs/PVs AFAIK), would be to create a new PVC and mount both volumes on a Deployment whose Pod has root access and the rsync command.
Running rsync -a /old/volume/mount/path /new/volume/mount/path on such a Pod should get you what you want.
However, you should make sure that you do so BEFORE deleting PVCs or any other resource using your PVs. By default, most of the default storage classes create volumes with reclaim policies that immediately delete the PV as soon as all resources using it are gone, so there's a small risk of data loss

It is not possible in Kubernetes to change the storage class of a PVC - because of the described reason. Unfortunately you have to create a new PVC with the desired storage class.

One way to go about this would be to create a snapshot and then new volume from the that snapshot : https://kubernetes.io/docs/concepts/storage/volume-snapshots/
Another is to leverage CSI cloning : https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/
Both effectively facilitate creation of a fresh PV with a copy of existing data

In extension to #LeoD's answer, I'm sharing a complete step by step guide of what I did exactly eventually.
Scale Stateful Set to 0.
Find the old PVs' Disk resources in Azure:
kubectl describe pv to check which PVs are associated with the PVCs I want to replace.
Follow the resource ID that is specified in field DiskURI to get to the Disks in Azure.
Create an Azure Snapshot resource from each PVC Disk.
Create a new Disk with at least the same allocated GiB as current PVC has from the Snapshots. SKU still doesn't matter at this point but I preferred using the new SKU I want to implement.
Create a K8S Persistent Volume from each new Disk:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-from-azure-disk-0
namespace: my-namespace
annotations:
pv.kubernetes.io/provisioned-by: disk.csi.azure.com
spec:
capacity:
storage: 1064
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: managed
claimRef:
name: pvc-from-azure-0
namespace: my-namespace
csi:
driver: disk.csi.azure.com
volumeHandle: /subscriptions/<Your subscription ID>/resourcegroups/<Name of RG where the new Disk is stored at>/providers/microsoft.compute/disks/new-disk-from-pvc-0
fsType: ext4
Make sure to edit the following fields:
metadata.namespace: To the namespace where you have your old resources.
spec.capacity.storage: To the allocated GiB you chose for the new Disk.
spec.storageClassName: Name of Storage Class that suits the SKU you chose for the Disk.
spec.claimRef.name: Here we specify a name for a PVC we haven't created yet that will be used later, just make sure the name you choose here is the name you're going to use when creating the PVC in the next step.
spec.claimRef.namespace: To the namespace where you have your old resources.
csi.volumeHandle: Insert Resource ID of your new Disk.
Create a K8S Persistent Volume Claim from each new Persistent Volume you've just created:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-from-azure-0
namespace: my-namespace
spec:
storageClassName: managed
volumeName: pv-from-azure-disk-0
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1064
Make sure to edit the following fields:
metadata.name: Use the exact same name you used in the PV's spec.claimRef.name field.
metadata.namespace: To the namespace where you have your old resources.
spec.storageClassName: To the name of Storage Class that you specified for the PV in the spec.storageClassName field.
spec.volumeName: To the name of the PV you've created in the former step.
spec.resources.requests.storage: To the allocated GiB you specified in the PV's spec.capacity.storage field.
Perform steps 2-6 for each existing PVC you want to replace; if your Stateful Set had 3 Pods for example you should do that 3 times for each Pod's PVC.
Create an Ubuntu Pod for each new PVC and mount the PVC to the pod. (If you have 3 PVCs for example you should be having 3 Ubuntu pods, one for each).
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-0
namespace: my-namespace
spec:
containers:
- name: ubuntu-0
image: ubuntu
command:
- "sleep"
- "604800"
volumeMounts:
- mountPath: /mnt/snapshot-data-0
name: snapshot-data-0
volumes:
- name: snapshot-data-0
persistentVolumeClaim:
claimName: pvc-from-azure-0
Make sure to edit the following fields:
metadata.namespace: To the namespace where you have your old resources.
spec.volumes.persistentVolumeClaim.claimName: To the name of the new PVC you've created.
Exec to each Ubuntu Pod and validate all your data is in there.
If all data is there you can delete the Ubuntu Pod.
Now is the very crucial part, you need to delete the old original PVCs and Stateful Set but before you do that, make sure you have the original yaml file of the Stateful Set or anything else you used to create the Stateful Set, make sure you know how to re-create the Stateful Set and only then continue to delete the old original PVCs and Stateful Set.
After deletion of Stateful Set and old PVCs has been completed, in your Stateful Set yaml or configuration replace the value of storageClassName that's within the volumeClaimTemplates block to the new Storage Class you want to use instead.
...
volumeClaimTemplate:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "managed" # <--- Edit this field
resources:
requests:
storage: "1064"
limits:
storage: "1064"
...
Now recreate the Stateful Set with the new configuration. This will recreate the Stateful Set and create new empty PVCs which are going to use the new specified SKU from the Storage Class.
After creation of the PVCs has been completed, scale the new Stateful Set to 0.
Now recreate the Ubuntu Pod but this time mount both the PVC from the Azure Disk you created in the first steps and the new empty PVC of the new Stateful Set:
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-0
namespace: my-namespace
spec:
containers:
- name: ubuntu-0
image: ubuntu
command:
- "sleep"
- "604800"
volumeMounts:
- mountPath: /mnt/snapshot-data-0
name: snapshot-data-0
- mountPath: /mnt/new-data-0
name: new-data-0
volumes:
- name: snapshot-data-0
persistentVolumeClaim:
claimName: pvc-from-azure-0
- name: new-data-0
persistentVolumeClaim:
claimName: my-stateful-set-data-0
Make sure to edit the following fields:
metadata.namespace: To the namespace where you have your resources.
spec.volumes.persistentVolumeClaim: There are two fields of this, one for snapshot-data-0, and one for new-data-0. As their names suggest, the snapshot-data-0 should have the old data from the Azure Disk we created in the first steps, and the new-data-0 should be empty since it should be the new PVC that was created from the recreation of the Stateful Set.
So, the value for spec.volumes.persistentVolumeClaim of snapshot-data-0 should be the name of the new PVC you've created from the Azure Disk. And the value for spec.volumes.persistentVolumeClaim of new-data-0 should be the name of the new PVC that was created following the recreation of the Stateful Set.
Exec to the Ubuntu Pod and perform the following commands:
apt update && apt install rsync -y
nohup rsync -avz /mnt/snapshot-data-0/ /mnt/new-data-0 --log-file=$HOME/.rsyncd.log &
First command is to install rsync and second command to copy all the old data into the new empty PVC and keep logs of the command at /root/.rsyncd.log for your reference of checking the progress. Do notice depending on the amount of data you have stored that this command could take some time. For me personally it took about 1 hour to complete.
Perform steps 15-16 for each backed up and old PVC pair.
Once it finishes for all PVC pairs, delete the Ubuntu Pods.
Scale the new Stateful Set back up.
Profit! You now have the same Stateful Set having the same data but with a different Storage Class.

Related

What happens when we create stateful set with many replicas with one pvc in kubernetes?

Im new to kubernetes and this topic is confusing for me. I've learned that stateful set doesn't share the PV and each replica has it's own PV. On the other hand I saw the examples when one was using one pvc in stateful set with many replicas. So my question is what will happen then? As PVC to PV are bind 1:1 so one pvc can only bind to one pv, but each replica should have its own PV so how is it possible to have one pvc in stateful set in this scenario?
You should usually use a volume claim template with a StatefulSet. As you note in the question, this will create a new PersistentVolumeClaim (and a new PersistentVolume) for each replica. Data is not shared, except to the extent the container process knows how to replicate data between its replicas. If a StatefulSet Pod is deleted and recreated, it will come back with the same underlying PVC and the same data, even if it is recreated on a different Node.
spec:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
template:
spec:
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
You're allowed to manually create a PVC and attach it to the StatefulSet Pods
# not recommended -- one PVC shared across all replicas
spec:
template:
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: manually-created-pvc
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
but in this case the single PVC/PV will be shared across all of the replicas. This often doesn't work well: things like database containers have explicit checks that their storage isn't shared, and there is a range of concurrency problems that are possible doing this. This also can prevent pods from starting up since the volume types that are straightforward to get generally only support a ReadWriteOnce access mode; to get ReadWriteMany you need to additionally configure something like an NFS server outside the cluster.
i am not sure which example you were following and checked that scenario however yes PV and PVC is 1:1 mapping.
Usually, PVC gets attached to POD with access mode ReadWriteOnly which mean only one pod can do ReadWrite.
The scenario that you have might be seen could be something like there is a single PVC and single PV attach to multiple replicas which could be due to ReadWriteMany.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It
is similar to a Pod. Pods consume node resources and PVCs consume PV
resources. Pods can request specific levels of resources (CPU and
Memory). Claims can request specific size and access modes (e.g., they
can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see
AccessModes).
Read more about access mode here : https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
NFS, EFS and other type of storage support the ReadWriteMany access mode.
When I deploy e.g. nginx as SS and I use one PVC only one PV is created and storage is shared between all replicas.
You experiment is correct, this is possible because the scheduler has assigned all of the pods on the same node due to the dependency to the PV. If the node runs out of resources and result to a pod gets schedule on another node, that pod will enter pending state.

Reuse PV in Deployment

What I need?
A deployment with 2 PODs which read from the SAME volume (PV). The volume must be shared between PODS in a RW mode.
Note: I already have a rook ceph with a defined storageClass "rook-cephfs" which allow this capability. This SC also has Retain Policy
This is what I did:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-nginx
spec:
accessModes:
- "ReadWriteMany"
resources:
requests:
storage: "10Gi"
storageClassName: "rook-cephfs"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
serviceAccountName: default
containers:
- name: nginx
image: nginx:latest
imagePullPolicy: Always
ports:
- name: http
containerPort: 80
volumeMounts:
- name: pvc-data
mountPath: /data
volumes:
- name: pvc-data
persistentVolumeClaim:
claimName: data-nginx
It works! Both nginx containers shares the volume.
Problem:
If a delete all the resources (except the PV) and a recreate them, a NEW PV is created instead of reuse the old one. So basically, the new volume is empty.
The OLD PV get the status "Released" instead of "Available"
I realized that if a apply a patch to the PV to remove the claimRef.uid :
kubectl patch pv $PV_NAME --type json -p '[{"op": "remove", "path": "/spec/claimRef/uid"}]'
and then redeploy it works.
But I don't want to do this manual step. I need this automated.
I also tried the same configuration with a statefulSet and got the same problem.
Any solution?
Make sure to use reclaimPolicy: Retain in your StorageClass. It will tell Kubernetes to reuse the PV.
Ref: https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/
But I don't want to do this manual step. I need this automated.
Based on the official documentation, it is unfortunately impossible. First look at the Reclaim Policy:
PersistentVolumes that are dynamically created by a StorageClass will have the reclaim policy specified in the reclaimPolicy field of the class, which can be either Delete or Retain. If no reclaimPolicy is specified when a StorageClass object is created, it will default to Delete.
So, we have 2 supported options for Reclaim Policy: Delete or Retain.
Delete option is not for you, because,
for volume plugins that support the Delete reclaim policy, deletion removes both the PersistentVolume object from Kubernetes, as well as the associated storage asset in the external infrastructure, such as an AWS EBS, GCE PD, Azure Disk, or Cinder volume. Volumes that were dynamically provisioned inherit the reclaim policy of their StorageClass, which defaults to Delete. The administrator should configure the StorageClass according to users' expectations; otherwise, the PV must be edited or patched after it is created.
Retain option allows you for manual reclamation of the resource:
When the PersistentVolumeClaim is deleted, the PersistentVolume still exists and the volume is considered "released". But it is not yet available for another claim because the previous claimant's data remains on the volume. An administrator can manually reclaim the volume with the following steps.
Delete the PersistentVolume. The associated storage asset in external infrastructure (such as an AWS EBS, GCE PD, Azure Disk, or Cinder volume) still exists after the PV is deleted.
Manually clean up the data on the associated storage asset accordingly.
Manually delete the associated storage asset, or if you want to reuse the same storage asset, create a new PersistentVolume with the storage asset definition.

How do I mount and format new google compute disk to be mounted in a GKE pod?

I have created a new disk in Google Compute Engine.
gcloud compute disks create --size=10GB --zone=us-central1-a dane-disk
It says I need to format it. But I have no idea how could I mount/access the disk?
gcloud compute disks list
NAME LOCATION LOCATION_SCOPE SIZE_GB TYPE STATUS
notowania-disk us-central1-a zone 10 pd-standard READY
New disks are unformatted. You must format and mount a disk before it
can be used. You can find instructions on how to do this at:
https://cloud.google.com/compute/docs/disks/add-persistent-disk#formatting
I tried instruction above but lsblk is not showing the disk at all
Do I need to create a VM and somehow attach it to it in order to use it? My goal was to mount the disk as a persistent GKE volume independent of the VM (last time GKE upgrade caused recreation of VM and data loss)
Thanks for the clarification of what you are trying to do in the comments.
I have 2 different answers here.
The first is that my testing shows that the Kubernetes GCE PD documentation is exactly right, and the warning about formatting seems like it can be safely ignored.
If you just issue:
gcloud compute disks create --size=10GB --zone=us-central1-a my-test-data-disk
And then use it in a pod:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: nginx
name: nginx-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-test-data-disk
fsType: ext4
It will be formatted when it is mounted. This is likely because the fsType parameter instructs the system how to format the disk. You don't need to do anything with a separate GCE instance. The disk is retained even if you delete the pod or even the entire cluster. It is not reformatted on uses after the first and the data is kept around.
So, the warning message from gcloud is confusing, but can be safely ignored in this case.
Now, in order to dynamically create a persistent volume based on GCE PD that isn't automatically deleted, you will need to create a new StorageClass that sets the Reclaim Policy to Retain, and then create a PersistentVolumeClaim based on that StorageClass. This also keeps basically the entire operation inside of Kubernetes, without needing to do anything with gcloud. Likewise, a similar approach is what you would want to use with a StatefulSet as opposed to a single pod, as described here.
Most of what you are looking to do is described in this GKE documentation about dynamically allocating PVCs as well as the Kubernetes StorageClass documentation. Here's an example:
gce-pd-retain-storageclass.yaml:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gce-pd-retained
reclaimPolicy: Retain
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replication-type: none
The above storage class is basically the same as the 'standard' GKE storage class, except with the reclaimPolicy set to Retain.
pvc-demo.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-demo-disk
spec:
accessModes:
- ReadWriteOnce
storageClassName: gce-pd-retained
resources:
requests:
storage: 10Gi
Applying the above will dynamically create a disk that will be retained when you delete the claim.
And finally a demo-pod.yaml that mounts the PVC as a volume (this is really a nonsense example using nginx, but it demonstrates the syntax):
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: nginx
name: nginx-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: pvc-demo-disk
Now, if you apply these three in order, you'll get a container running using the PersistentVolumeClaim which has automatically created (and formatted) a disk for you. When you delete the pod, the claim keeps the disk around. If you delete the claim the StorageClass still keeps the disk from being deleted.
Note that the PV that is left around after this won't be automatically reused, as the data is still on the disk. See the Kubernetes documentation about what you can do to reclaim it in this case. Really, this mostly says that you shouldn't delete the PVC unless you're ready to do work to move the data off the old volume.
Note that these disks will even continue to exist when the entire GKE cluster is deleted as well (and you will continue to be billed for them until you delete them).

Kubernetes multiple pvc with statefulset for each pod vs single pvc for all pods?

I have deploy kubernetes cluster with stateful pod for mysql. for each pod i have different pvc.
for example : if 3 pod thn 3 5GB EBS PVC
SO Which way is better using one PVC for all pods or use different pvc for each pod.
StatefulSet must use volumeClaimTemplates if you want to have dedicated storage for each pod of a set. Based on that template PersistentVolumeClaim for each pod is created and configured the volume to be bound to that claim. The generated PersistentVolumeClaims names consists of volumeClaimTemplate name + pod-name + ordinal number.
So if you add volumeClaimTemplate part to your StatefulSet YAML(and delete specific persistentVolumeClaim references), smth like that:
volumeClaimTemplates:
- metadata:
name: mysql-data
spec:
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
Then go and create your StatefulSet and after to examine one of its pods (kubectl get pods pod-name-0 yaml) you’ll see smth like that(volumes part of the output):
volumes:
- name: mysql-data
persistentVolumeClaim:
claimName: mysql-data-pod-name-0. | dynamically created claim based on the template
So by using volumeClaimTemplates you don’t need to create a separate PVCs yourself and then in each separate StatefulSet reference that PVC to be mounted in your container at a specific mountPath(remember that each pod of a set must reference a different PVC, 1PVC-1PV) :
Part of “containers” definition of your Statefulset YAML:
volumeMounts:
- name: mysql-data || references your PVC by -name(not PVC name itself)
mountPath: /var/lib/mysql
So to aim for each pod of the set to have dedicated storage and not use volumeClaimTemplates is leading to a lot of problems and over complications to manage and scale it.
A PVC gets bound to a specific PV. For a StatefulSet, in most imaginable cases, you want to have a PV that can be accessed only by certain pod, so that data is not corrupted by write attempt from a parallel process/pod (RWO rather then RWX mode).
With that in mind, you need a PVC per replica in StatefulSet. Creating PVCs for replicas would get problematic very quickly if done manualy, this is why the right way to do it is to use volumeClaimTemplates that will dynamicaly create PVCs for you as you scale your set.

Kubernetes trouble with StatefulSet and 3 PersistentVolumes

I'm in the process of creating a StatefulSet based on this yaml, that will have 3 replicas. I want each of the 3 pods to connect to a different PersistentVolume.
For the persistent volume I'm using 3 objects that look like this, with only the name changed (pvvolume, pvvolume2, pvvolume3):
kind: PersistentVolume
apiVersion: v1
metadata:
name: pvvolume
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/nfs"
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: mongo-persistent-storage-mongo-0
The first of the 3 pods in the StatefulSet seems to be created without issue.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container.
Yet if I go to the tab showing PersistentVolumeClaims the second one that was created seems to have been successful.
If it was successful why does the pod think it failed?
I want each of the 3 pods to connect to a different PersistentVolume.
For that to work properly you will either need:
provisioner (in link you posted there are example how to set provisioner on aws, azure, googlecloud and minicube) or
volume capable of being mounted multiple times (such as nfs volume). Note however that in such a case all your pods read/write to the same folder and this can lead to issues when they are not meant to lock/write to same data concurrently. Usual use case for this is upload folder that pods are saving to, that is later used for reading only and such use cases. SQL Databases (such as mysql) on the other hand, are not meant to write to such shared folder.
Instead of either of mentioned requirements in your claim manifest you are using hostPath (pointing to /nfs) and set it to ReadWriteOnce (only one can use it). You are also using 'standard' as storage class and in url you gave there are fast and slow ones, so you probably created your storage class as well.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container
That is because first pod already took it's claim (read write once, host path) and second pod can't reuse same one if proper provisioner or access is not set up.
If it was successful why does the pod think it failed?
All PVC were successfully bound to accompanying PV. But you are never bounding second and third PVC to second or third pods. You are retrying with first claim on second pod, and first claim is already bound (to fist pod) in ReadWriteOnce mode and can't be bound to second pod as well and you are getting error...
Suggested approach
Since you reference /nfs as your host path, it may be safe to assume that you are using some kind of NFS-backed file system so here is one alternative setup that can get you to mount dynamically provisioned persistent volumes over nfs to as many pods in stateful set as you want
Notes:
This only answers original question of mounting persistent volumes across stateful set replicated pods with the assumption of nfs sharing.
NFS is not really advisable for dynamic data such as database. Usual use case is upload folder or moderate logging/backing up folder. Database (sql or no sql) is usually a no-no for nfs.
For mission/time critical applications you might want to time/stresstest carefully prior to taking this approach in production since both k8s and external pv are adding some layers/latency in-between. Although for some application this might suffice, be warned about it.
You have limited control of name for pv that are being dynamically created (k8s adds suffix to newly created, and reuses available old ones if told to do so), but k8s will keep them after pod get terminated and assign first available to new pod so you won't loose state/data. This is something you can control with policies though.
Steps:
for this to work you will first need to install nfs provisioner from here:
https://github.com/kubernetes-incubator/external-storage/tree/master/nfs. Mind you that installation is not complicated but has some steps where you have to take careful approach (permissions, setting up nfs shares etc) so it is not just fire-and-forget deployment. Take your time installing nfs provisioner correctly. Once this is properly set up you can continue with suggested manifests below:
Storage class manifest:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: sc-nfs-persistent-volume
# if you changed this during provisioner installation, update also here
provisioner: example.com/nfs
Stateful Set (important excerpt only):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ss-my-app
spec:
replicas: 3
...
selector:
matchLabels:
app: my-app
tier: my-mongo-db
...
template:
metadata:
labels:
app: my-app
tier: my-mongo-db
spec:
...
containers:
- image: ...
...
volumeMounts:
- name: persistent-storage-mount
mountPath: /wherever/on/container/you/want/it/mounted
...
...
volumeClaimTemplates:
- metadata:
name: persistent-storage-mount
spec:
storageClassName: sc-nfs-persistent-volume
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 10Gi
...