Moving Pod to another node automatically - kubernetes

Is it possible for a pod/deployment/statefulset to be moved to another node or be recreated on another node automatically if the first node fails? The pod in question is set to 1 replica. So is it possible to configure some sort of failover for kubernetes pods? I've tried out pod affinity settings but nothing is moved automatically it has been around 10 minutes.
the yaml for the said pod is like below:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ceph-rbd-sc-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: ceph-rbd-sc
---
apiVersion: v1
kind: Pod
metadata:
name: ceph-rbd-pod-pvc-sc
labels:
app: ceph-rbd-pod-pvc-sc
spec:
containers:
- name: ceph-rbd-pod-pvc-sc
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
nodeSelector:
etiket: worker
volumes:
- name: volume
persistentVolumeClaim:
claimName: ceph-rbd-sc-pvc
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
name: ceph-rbd-pod-pvc-sc
topologyKey: "kubernetes.io/hostname"
Edit:
I managed to get it to work. But now i have another problem, the newly created pod in the other node is stuck in "container creating" and the old pod is stuck in "terminating". I also get Multi-Attach error for volume stating that the pv is still in use by the old pod. The situation is the same for any deployment/statefulset with a pv attached, the problem is resolved only when the failed node comes back online. Is there a solution for this?

Answer from coderanger remains valid regarding Pods. Answering to your last edit:
Your issue is with CSI.
When your Pod uses a PersistentVolume whose accessModes is RWO.
And when the Node hosting your Pod gets unreachable, prompting Kubernetes scheduler to Terminate the current Pod and create a new one on another Node
Your PersistentVolume can not be attached to the new Node.
The reason for this is that CSI introduced some kind of "lease", marking a volume as bound.
With previous CSI spec & implementations, this lock was not visible, in terms of Kubernetes API. If your ceph-csi deployment is recent enough, you should find a corresponding "VolumeAttachment" object that could be deleted, to fix your issue:
# kubectl get volumeattachments -n ci
NAME ATTACHER PV NODE ATTACHED AGE
csi-194d3cfefe24d5f22616fabd3d2fb2ce5f79b16bdca75088476c2902e7751794 rbd.csi.ceph.com pvc-902c3925-11e2-4f7f-aac0-59b1edc5acf4 melpomene.xxx.com true 14d
csi-24847171efa99218448afac58918b6e0bb7b111d4d4497166ff2c4e37f18f047 rbd.csi.ceph.com pvc-b37722f7-0176-412f-b6dc-08900e4b210d clio.xxx.com true 90d
....
kubectl delete -n ci volumeattachment csi-xxxyyyzzz
Those VolumeAttachments are created by your CSI provisioner, before the device mapper attaches a volume.
They would be deleted only once the corresponding PV would have been released from a given Node, according to its device mapper - that needs to be running, kubelet up/Node marked as Ready according to the the API. Until then, other Nodes can't map it. There's no timeout, should a Node get unreachable due to network issues or an abrupt shutdown/force off/reset: its RWO PV are stuck.
See: https://github.com/ceph/ceph-csi/issues/740
One workaround for this would be not to use CSI, and rather stick with legacy StorageClasses, in your case installing rbd on your nodes.
Though last I checked -- k8s 1.19.x -- I couldn't manage to get it working, I can't recall what was wrong, ... CSI tends to be "the way" to do it, nowadays. Despite not being suitable for production use, sadly, unless running in an IAAS with auto-scale groups deleting Nodes from the Kubernetes API (eventually evicting the corresponding VolumeAttachments), or using some kind of MachineHealthChecks like OpenShift 4 implements.

A bare Pod is a single immutable object. It doesn't have any of these nice things. Related: never ever use bare Pods for anything. If you try this with a Deployment you should see it spawn a new one to get back to the requested number of replicas. If the new Pod is Unschedulable you should see events emitted explaining why. For example if only node 1 matches the nodeSelector you specified, or if another Pod is already running on the other node which triggers the anti-affinity.

Related

What happens when we create stateful set with many replicas with one pvc in kubernetes?

Im new to kubernetes and this topic is confusing for me. I've learned that stateful set doesn't share the PV and each replica has it's own PV. On the other hand I saw the examples when one was using one pvc in stateful set with many replicas. So my question is what will happen then? As PVC to PV are bind 1:1 so one pvc can only bind to one pv, but each replica should have its own PV so how is it possible to have one pvc in stateful set in this scenario?
You should usually use a volume claim template with a StatefulSet. As you note in the question, this will create a new PersistentVolumeClaim (and a new PersistentVolume) for each replica. Data is not shared, except to the extent the container process knows how to replicate data between its replicas. If a StatefulSet Pod is deleted and recreated, it will come back with the same underlying PVC and the same data, even if it is recreated on a different Node.
spec:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
template:
spec:
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
You're allowed to manually create a PVC and attach it to the StatefulSet Pods
# not recommended -- one PVC shared across all replicas
spec:
template:
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: manually-created-pvc
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
but in this case the single PVC/PV will be shared across all of the replicas. This often doesn't work well: things like database containers have explicit checks that their storage isn't shared, and there is a range of concurrency problems that are possible doing this. This also can prevent pods from starting up since the volume types that are straightforward to get generally only support a ReadWriteOnce access mode; to get ReadWriteMany you need to additionally configure something like an NFS server outside the cluster.
i am not sure which example you were following and checked that scenario however yes PV and PVC is 1:1 mapping.
Usually, PVC gets attached to POD with access mode ReadWriteOnly which mean only one pod can do ReadWrite.
The scenario that you have might be seen could be something like there is a single PVC and single PV attach to multiple replicas which could be due to ReadWriteMany.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It
is similar to a Pod. Pods consume node resources and PVCs consume PV
resources. Pods can request specific levels of resources (CPU and
Memory). Claims can request specific size and access modes (e.g., they
can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see
AccessModes).
Read more about access mode here : https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
NFS, EFS and other type of storage support the ReadWriteMany access mode.
When I deploy e.g. nginx as SS and I use one PVC only one PV is created and storage is shared between all replicas.
You experiment is correct, this is possible because the scheduler has assigned all of the pods on the same node due to the dependency to the PV. If the node runs out of resources and result to a pod gets schedule on another node, that pod will enter pending state.

Adding a volume to a Kubernetes StatefulSet using kubectl patch

Problem summary:
I am following the Kubernetes guide to set up a sample Cassandra cluster. The cluster is up and running, and I would like to add a second volume to each node in order to try enable backups for Cassandra that would be stored on a separate volume.
My attempt to a solution:
I tried editing my cassandra-statefulset.yaml file by adding a new volumeMounts and volumeClaimTemplates entry, and reapplying it, but got the following error message:
$ kubectl apply -f cassandra-statefulset.yaml
storageclass.storage.k8s.io/fast unchanged
The StatefulSet "cassandra" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
I then tried to enable rolling updates and patch my configuration following the documentation here:
https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch/
$ kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}'
statefulset.apps/cassandra patched (no change)
My cassandra-backup-patch.yaml:
spec:
template:
spec:
containers:
volumeMounts:
- name: cassandra-backup
mountPath: /cassandra_backup
volumeClaimTemplates:
- metadata:
name: cassandra-backup
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 1Gi
However this resulted in the following error:
$ kubectl patch statefulset cassandra --patch "$(cat cassandra-backup-patch.yaml)"
The request is invalid: patch: Invalid value: "map[spec:map[template:map[spec:map[containers:map[volumeMounts:[map[mountPath:/cassandra_backup name:cassandra-backup]]]]] volumeClaimTemplates:[map[metadata:map[name:cassandra-backup] spec:map[accessModes:[ReadWriteOnce] resources:map[requests:map[storage:1Gi]] storageClassName:fast]]]]]": cannot restore slice from map
Could anyone please point me to the correct way of adding an additional volume for each node or explain why the patch does not work? This is my first time using Kubernetes so my approach may be completely wrong. Any comment or help is very welcome, thanks in advance.
The answer is in your first log:
The StatefulSet "cassandra" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy'
You can't change some fields in a statefulset after creation. You will likely need to delete and recreate the statefulset to add a new volumeClaimTemplate.
edit:
It can many times be useful to leave your pods running even when you delete the statefulset. To accomplish this use the --cascade=false flag on the delete operation.
kubectl delete statefulset <name> --cascade=false
Then your workload will stay running while you recreate your statefulset with the updated VPC.
As mentioned by switchboard.op, deleting is the answer.
but
Watch out for deleting these objects:
PersistentVolumeClaim (kubectl get pvc)
PersistentVolume (kubectl get pv)
which for example in case you'd want to do just helm uninstall instead of kubectl delete statefulset/<item> will be deleted thus unless there's any other reference for the volumes and in case you don't have backups of the previous YAMLs that contain the IDs (i.e. not just generated from Helm templates, but from the orchestrator) you might have a painful day ahead of you.
PVCs and PVs hold IDs and other reference properties for the underlying (probably/mostly?) vendor specific volume referencing by e.g. S3 or other object or file storage implementation used in the background as a volume in a Pod or other resources.
Deleting or otherwise modifying a StatefulSet if you preserve the PVC name within the spec doesn't affect mounting of the correct resource.
If in doubt, always just copy locally the whole volume prior to doing destructive action to PVCs and PVs if you need them in the future or running commands without knowing the underlying source code e.g. by:
kubectl cp <some-namespace>/<some-pod>:/var/lib/something /tmp/backup-something
and then just load it back by reversing the arguments.
Also for Helm usage, delete the StatefulSet, then issue helm upgrade command and it'll fix the missing StatefulSet without touching PVCs and PVs.

Volume is already exclusively attached to one node and can't be attached to another

I have a pretty simple Kubernetes pod. I want a stateful set and want the following process:
I want to have an initcontainer download and uncompress a tarball from s3 into a volume mounted to the initcontainer
I want to mount that volume to my main container to be used
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: app
namespace: test
labels:
name: app
spec:
serviceName: app
replicas: 1
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
initContainers:
- name: preparing
image: alpine:3.8
imagePullPolicy: IfNotPresent
command:
- "sh"
- "-c"
- |
echo "Downloading data"
wget https://s3.amazonaws.com/.........
tar -xvzf xxxx-........ -C /root/
volumeMounts:
- name: node-volume
mountPath: /root/data/
containers:
- name: main-container
image: ecr.us-west-2.amazonaws.com/image/:latest
imagePullPolicy: Always
volumeMounts:
- name: node-volume
mountPath: /root/data/
volumeClaimTemplates:
- metadata:
name: node-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: gp2-b
resources:
requests:
storage: 80Gi
I continue to get the following error:
At first I run this and I can see the logs flowing of my tarball being downloaded by the initcontainer. About half way done it terminates and gives me the following error:
Multi-Attach error for volume "pvc-faedc8" Volume is
already exclusively attached to one node and can't be
attached to another
Looks like you have a dangling PVC and/or PV that is attached to one of your nodes. You can ssh into the node and run a df or mount to check.
If you look at this the PVCs in a StatefulSet are always mapped to their pod names, so it may be possible that you still have a dangling pod(?)
If you have a dangling pod:
$ kubectl -n test delete pod <pod-name>
You may have to force it:
$ kubectl -n test delete pod <pod-name> --grace-period=0 --force
Then, you can try deleting the PVC and it's corresponding PV:
$ kubectl delete pvc pvc-faedc8
$ kubectl delete pv <pv-name>
I had the same issue right now and the problem was, that the node on which the pod is usually running on was down and another one took over (which didn't work as expected for whatever reason). Had the "node down" scenario a few times before already and it never caused any issues. Couldn't get the StatefulSet and Deployment back up and running without booting the node that was down. But as soon as the node was up and running again the StatefulSet and Deployment immediately came back to life as well.
I had a similar error:
The volume pvc-2885ea01-f4fb-11eb-9528-00505698bd8b
cannot be attached to the node node1 since it is already attached to the node node2*
I use longhorn as a storage provisioner and manager. So I just detached this pv in the error and restarted the stateful set. It automatically was able to attach to the pv correctly this time.
I'll add an answer that will prevent this from happening again.
Short answer
Access modes: Switch from ReadWriteOnce to ReadWriteMany.
In a bit more details
You're usng a StatefulSet where each replica has its own state, with a unique persistent volume claim (PVC) created for each pod.
Each PVC is referring to a Persistent Volume where you decided that the access mode is ReadWriteOnce.
Which as you can see from here:
ReadWriteOnce
the volume can be mounted as read-write by a single
node. ReadWriteOnce access mode still can allow multiple pods to
access the volume when the pods are running on the same node.
So in case K8S Scheduler (due to priorities or resource calculations or due to a Cluster autoscaler which decided to shift the pod to a different node) - you will receive an error that the volume is already exclusively attached to one node and can't be
attached to another node.
Please consider using ReadWriteMany where the volume can be mounted as read-write by many nodes.

Kubernetes trouble with StatefulSet and 3 PersistentVolumes

I'm in the process of creating a StatefulSet based on this yaml, that will have 3 replicas. I want each of the 3 pods to connect to a different PersistentVolume.
For the persistent volume I'm using 3 objects that look like this, with only the name changed (pvvolume, pvvolume2, pvvolume3):
kind: PersistentVolume
apiVersion: v1
metadata:
name: pvvolume
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/nfs"
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: mongo-persistent-storage-mongo-0
The first of the 3 pods in the StatefulSet seems to be created without issue.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container.
Yet if I go to the tab showing PersistentVolumeClaims the second one that was created seems to have been successful.
If it was successful why does the pod think it failed?
I want each of the 3 pods to connect to a different PersistentVolume.
For that to work properly you will either need:
provisioner (in link you posted there are example how to set provisioner on aws, azure, googlecloud and minicube) or
volume capable of being mounted multiple times (such as nfs volume). Note however that in such a case all your pods read/write to the same folder and this can lead to issues when they are not meant to lock/write to same data concurrently. Usual use case for this is upload folder that pods are saving to, that is later used for reading only and such use cases. SQL Databases (such as mysql) on the other hand, are not meant to write to such shared folder.
Instead of either of mentioned requirements in your claim manifest you are using hostPath (pointing to /nfs) and set it to ReadWriteOnce (only one can use it). You are also using 'standard' as storage class and in url you gave there are fast and slow ones, so you probably created your storage class as well.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container
That is because first pod already took it's claim (read write once, host path) and second pod can't reuse same one if proper provisioner or access is not set up.
If it was successful why does the pod think it failed?
All PVC were successfully bound to accompanying PV. But you are never bounding second and third PVC to second or third pods. You are retrying with first claim on second pod, and first claim is already bound (to fist pod) in ReadWriteOnce mode and can't be bound to second pod as well and you are getting error...
Suggested approach
Since you reference /nfs as your host path, it may be safe to assume that you are using some kind of NFS-backed file system so here is one alternative setup that can get you to mount dynamically provisioned persistent volumes over nfs to as many pods in stateful set as you want
Notes:
This only answers original question of mounting persistent volumes across stateful set replicated pods with the assumption of nfs sharing.
NFS is not really advisable for dynamic data such as database. Usual use case is upload folder or moderate logging/backing up folder. Database (sql or no sql) is usually a no-no for nfs.
For mission/time critical applications you might want to time/stresstest carefully prior to taking this approach in production since both k8s and external pv are adding some layers/latency in-between. Although for some application this might suffice, be warned about it.
You have limited control of name for pv that are being dynamically created (k8s adds suffix to newly created, and reuses available old ones if told to do so), but k8s will keep them after pod get terminated and assign first available to new pod so you won't loose state/data. This is something you can control with policies though.
Steps:
for this to work you will first need to install nfs provisioner from here:
https://github.com/kubernetes-incubator/external-storage/tree/master/nfs. Mind you that installation is not complicated but has some steps where you have to take careful approach (permissions, setting up nfs shares etc) so it is not just fire-and-forget deployment. Take your time installing nfs provisioner correctly. Once this is properly set up you can continue with suggested manifests below:
Storage class manifest:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: sc-nfs-persistent-volume
# if you changed this during provisioner installation, update also here
provisioner: example.com/nfs
Stateful Set (important excerpt only):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ss-my-app
spec:
replicas: 3
...
selector:
matchLabels:
app: my-app
tier: my-mongo-db
...
template:
metadata:
labels:
app: my-app
tier: my-mongo-db
spec:
...
containers:
- image: ...
...
volumeMounts:
- name: persistent-storage-mount
mountPath: /wherever/on/container/you/want/it/mounted
...
...
volumeClaimTemplates:
- metadata:
name: persistent-storage-mount
spec:
storageClassName: sc-nfs-persistent-volume
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 10Gi
...

Run a pod without scheduler in Kubernetes

I searched the documentation but I am unable to find out if I can run a pod in Kubernetes without Scheduler. If anyone can help with any pointers it would be helpful
Update:
I can attach a label to node and let pod stick to that label but that would involve going through the scheduler. Is there any method without daemonset and does not use scheduler.
The scheduler just sets the spec.nodeName field on the pod. You can set that to a node name yourself if you know which node you want to run your pod, though you are then responsible for ensuring the node has sufficient resources to run the pod (enough memory, free host ports, etc… all things the scheduler is normally responsible for checking before it assigns a pod to a node)
You want static pods
Static pods are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes.
You can simply add a nodeName attribute to the pod definition which by they normally field out by the scheduler therefor it's not a mandatory field.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- image: nginx
name: nginx
nodeName: node01
if the pod has been created and in pending state you have to recreate it with the new field, edit is not permitted with the nodeName att.
All the answers given here would require a scheduler to run.
I think what you want to do is create the manifest file of the pod and put it in the default manifest directory of the node in question.
Default directory is /etc/kubernetes/manifests/
The pod will automatically be created and if you wish to delete it, just delete the manifest file.
You can simply add a nodeName attribute to the pod definition
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeName: controlplane
containers:
- image: nginx
name: nginx
Now important point - check the node listed by using below command, and then assign to one of them:
kubectl get nodes