Kubernetes persistent volume data corrupted after multiple pod deletions

Kubernetes persistent volume data corrupted after multiple pod deletions - kubernetes

I am stuggling with a simple one replica deployment of the official event store image on a Kubernetes cluster. I am using a persistent volume for the data storage.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-eventstore
spec:
strategy:
type: Recreate
replicas: 1
template:
metadata:
labels:
app: my-eventstore
spec:
imagePullSecrets:
- name: runner-gitlab-account
containers:
- name: eventstore
image: eventstore/eventstore
env:
- name: EVENTSTORE_DB
value: "/usr/data/eventstore/data"
- name: EVENTSTORE_LOG
value: "/usr/data/eventstore/log"
ports:
- containerPort: 2113
- containerPort: 2114
- containerPort: 1111
- containerPort: 1112
volumeMounts:
- name: eventstore-storage
mountPath: /usr/data/eventstore
volumes:
- name: eventstore-storage
persistentVolumeClaim:
claimName: eventstore-pv-claim
And this is the yaml for my persistent volume claim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: eventstore-pv-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
The deployments work fine. It's when I tested for durability that I started to encounter a problem. I delete a pod to force actual state from desired state and see how Kubernetes reacts.
It immediately launched a new pod to replace the deleted one. And the admin UI was still showing the same data. But after deleting a pod for the second time, the new pod did not come up. I got an error message that said "record too large" that indicated corrupted data according to this discussion. https://groups.google.com/forum/#!topic/event-store/gUKLaxZj4gw
I tried again for a couple of times. Same result every time. After deleting the pod for the second time the data is corrupted. This has me worried that an actual failure will cause similar result.
However, when deploying new versions of the image or scaling the pods in the deployment to zero and back to one no data corruption occurs. After several tries everything is fine. Which is odd since that also completely replaces pods (I checked the pod id's and they changed).
This has me wondering if deleting a pod using kubectl delete is somehow more forcefull in the way that a pod is terminated. Do any of you have similar experience? Of insights on if/how delete is different? Thanks in advance for your input.
Regards,
Oskar

I was refered to this pull request on Github that stated the the proces was not killed properly: https://github.com/EventStore/eventstore-docker/pull/52
After building a new image with the Docker file from the pull request put this image in the deployment. I am killing pods left and right, no data corruption issues anymore.
Hope this helps someone facing the same issue.

Related

Why is my Host Path Persistent Volume reachable from all pods?

I'm pretty stuck with this learning step of Kubernetes named PV and PVC.
What I'm trying to do here is understand how to handle shared read-write volume on multiple pods.
What I understood here is that a PVC cannot be shared between pods unless a NFS-like storage class has been configured.
I'm still with my hostPath Storage Class and I tried the following (Docker Desktop and 3 nodes microK8s cluster) :
This PVC with dynamic Host Path provisionning
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-desktop
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Mi
Deployment with 3 replicated pods writing on the same PVC.
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
spec:
replicas: 3
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox
image: library/busybox:stable
command: ["/bin/sh"]
args:
["-c", 'while true; do echo "1: $(hostname)" >> /root/index.html; sleep 2; done;',]
volumeMounts:
- mountPath: /root
name: vol-desktop
volumes:
- name: vol-desktop
persistentVolumeClaim:
claimName: pvc-desktop
Nginx server for serving volume content
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:stable
volumeMounts:
- mountPath: /usr/share/nginx/html
name: vol-desktop
ports:
- containerPort: 80
volumes:
- name: vol-desktop
persistentVolumeClaim:
claimName: pvc-desktop
Following what I understood on the documentation, this could not be possible, but in reality everything run pretty smoothly and my Nginx server displayed the up to date index.html file pretty well.
It actually worked on a single-node cluster and multi-node cluster.
What am I not getting here? Why this thing works?
Is every pod mounting is own host path volume on start?
How can a hostPath storage works between multiple nodes?
EDIT: For the multi-node case, a network folder has been created between the same storage path of each machine this is why everything has been replicated successfully. I didn't understand that the same host path is created on each node with that PVC mounted.
To anyone with the same problem: each node mounting this hostpath PVC will have is own folder created at the PV path.
So without network replication between nodes, only pods of the same node will share the same folder.
This is why it's discouraged on a multi-node cluster due to the unpredictable location of a pod on the cluster.
Thanks!

how to handle shared read-write volume on multiple pods.
Redesign your application to avoid it. It tends to be fragile and difficult to manage multiple writers safely; you depend on both your application correctly performing things like file locking, the underlying shared filesystem implementation handling things properly, and the system being tolerant of any sort of network hiccup that might happen.
The example you give is something that frequently appears in Docker Compose setups: have an application with a mix of backend code and static files, and then try to publish the static files at runtime through a volume to a reverse proxy. Instead, you can build an image that copies the static files at build time:
FROM nginx
ARG app_version=latest
COPY --from=my/app:${app_version} /app/static /usr/share/nginx/html
Have your CI system build this and push it immediately after the backend image is built. The resulting image serves the corresponding static files, but doesn't require a shared volume or any manual management of the volume contents.
For other types of content, consider storing data in a database, or use an object-storage service that maintains its own backing store and can handle the concurrency considerations. Then most of your pods can be totally stateless, and you can manage the data separately (maybe even outside Kubernetes).
How can a hostPath storage works between multiple nodes?
It doesn't. It's an instruction to Kubernetes, on whichever node the pod happens to be scheduled on, to mount that host directory into the container. There's no management of any sort of the directory content; if two pods get scheduled on the same node, they'll share the directory, and if not, they won't; and if your pod's Deployment is updated and the pod is deleted and recreated somewhere else, it might not be the same node and might not have the same data.
With some very specific exceptions you shouldn't use hostPath volumes at all. The exceptions are things like log collectors run as DaemonSets, where there is exactly one pod on every node and you're interested in picking up the host-directory content that is different on each node.
In your specific setup either you're getting lucky with where the data producers and consumers are getting colocated, or there's something about your MicroK8s setup that's causing the host directories to be shared. It is not in general reliable storage.

Kubernetes provisioning GCE Persistent disk sometimes fails

I'm currently using the GCE standard container cluster with lot of success and pleasure. But I had a question about the provisioning of GCE Persistent disks.
As described in this document form Kubernetes. I created two YAML files:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
and
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
If I now create a following Volume Claim:
{
"kind": "PersistentVolumeClaim",
"apiVersion": "v1",
"metadata": {
"name": "claim-test",
"annotations": {
"volume.beta.kubernetes.io/storage-class": "hdd"
}
},
"spec": {
"accessModes": [
"ReadWriteOnce"
],
"resources": {
"requests": {
"storage": "3Gi"
}
}
}
}
The disk gets created perfectly!
And if I now start following unit
apiVersion: v1
kind: ReplicationController
metadata:
name: nfs-server
spec:
replicas: 1
selector:
role: nfs-server
template:
metadata:
labels:
role: nfs-server
spec:
containers:
- name: nfs-server
image: gcr.io/google_containers/volume-nfs
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: claim-test
The disk gets mounted perfectly but many times I stumble upon the following error (not more can be found in the kubelet.log file):
Failed to attach volume "claim-test" on node "...." with: GCE persistent disk not found: diskName="....." zone="europe-west1-b"
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "....". list of unattached/unmounted volumes=[....]
Sometimes the pod boots perfectly, but sometimes it crashes. The only thing I could find is that there needs to be enough time between creating the PVC and the RC itself. I tried this many times but with the same uncertain results.
I hope someone can give me some kind of suggestion or help.
Thanks in advance!
Best regards,
Hacor

Thanks in advance for your comments! After a few days of searching I was finally able to determine what the problem was, I'm posting it because it may be useful other users.
I was using the NFS example for Kubernetes as a replication controller to provide my apps with NFS storage, but it seems that when the NFS server and the PV,PVC get deleted sometimes the NFS share gets stuck on the node itself and I think it has to do with the fact that I didn't delete this elements in a particular order and therefore the node got stuck with the share becoming incapable of mounting new shares to itself or the pod!
I noticed that the problem always occurred after I deleted some app (NFS, PV, PVC and other components) from the cluster. If I created a new cluster on GCE it works perfectly to create apps, until I delete one and it goes wrong...
What the correct deletion order is I don't know for sure, but I think:
Pods using the NFS share
PV, PVC of the NFS share
NFS server
If the pod takes longer to delete, and it isn't completely gone before PV is deleted, the node hangs with a mount it can't delete because it's in use, and that's where the problems occur.
I must honestly say that now I'm moving to an externally provisioned GlusterFS cluster.
Hope it helps someone!
Regards,
Hacor

Kubernetes rolling update in case of secret update

I have a Replication Controller with one replica using a secret. How can I update or recreate its (lone) pod—without downtime—with latest secret value when the secret value is changed?
My current workaround is increasing number of replicas in the Replication Controller, deleting the old pods, and changing the replica count back to its original value.
Is there a command or flag to induce a rolling update retaining the same container image and tag? When I try to do so, it rejects my attempt with the following message:
error: Specified --image must be distinct from existing container image

A couple of issues #9043 and #13488 describe the problem reasonably well, and I suspect a rolling update approach will eventuate shortly (like most things in Kubernetes), though unlikely for 1.3.0. The same issue applies with updating ConfigMaps.
Kubernetes will do a rolling update whenever anything in the deployment pod spec is changed (eg. typically image to a new version), so one suggested workaround is to set an env variable in your deployment pod spec (eg. RESTART_)
Then when you've updated your secret/configmap, bump the env value in your deployment (via kubectl apply, or patch, or edit), and Kubernetes will start a rolling update of your deployment.
Example Deployment spec:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: test-nginx
spec:
replicas: 2
template:
metadata:
spec:
containers:
- name: nginx
image: "nginx:stable"
ports:
- containerPort: 80
- mountPath: /etc/nginx/conf.d
name: config
readOnly: true
- mountPath: /etc/nginx/auth
name: tokens
readOnly: true
env:
- name: RESTART_
value: "13"
volumes:
- name: config
configMap:
name: test-nginx-config
- name: tokens
secret:
secretName: test-nginx-tokens
Two tips:
your environment variable name can't start with an _ or it magically disappears somehow.
if you use a number for your restart variable you need to wrap it in quotes

If I understand correctly, Deployment should be what you want.
Deployment supports rolling update for almost all fields in the pod template.
See http://kubernetes.io/docs/user-guide/deployments/

Restart a Successful/Failed pod manually

running kubernetes v1.2.2 on coreos on vmware:
I have a pod with the restart policy set to Never. Is it possible to manually start the same pod back up?
In my use case we will have a postgres instance in this pod. If it was to crash I would like to leave the pod in a failed state until we can look at it closer to see why it failed and then start it manually. Rather than try to restart with a restartpolicy of Always.
Looking through kubectl it doesnt seem like there is a manual start option. I could delete and recreate but i think this would remove the data from my container. Maybe I should be mounting a local volume on my host, and I should not need to worry about losing data?
this is my sample pod yaml. I dont seem to be able to restart the 'health' pod.
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Never

One simple method that might address your needs is to add a unique instance label, maybe a simple counter. If each pod is labelled differently you can start as many as you like and keep around as many failed instances as you like.
e.g. first pod
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
instance: 0
spec:
containers: ...
second pod
apiVersion: v1
kind: Pod
metadata:
name: health
labels:
environment: dev
app: health
instance: 1
spec:
containers: ...

Based on your question and comments sounds like you want to restart a failed container to retain its state and data. In fact, application containers and pods are considered to be relatively ephemeral (rather than durable) entities. When a container crashes its files will be lost and kubelet will restart it with a clean state.
To retain your data and logs use persistent volume types in your deployment. This will let you to preserve data across container restarts.

Unable to mount Amazon Web Services (AWS) EBS Volume into Kubernetes pod

I created a volume using the following command.
aws ec2 create-volume --size 10 --region us-east-1 --availability-zone us-east-1c --volume-type gp2
Then I used the file below to create a pod that uses the volume. But when I login to the pod, I don't see the volume. Is there something that I might be doing wrong? Did I miss a step somewhere? Thanks for any insights.
---
kind: "Pod"
apiVersion: "v1"
metadata:
name: "nginx"
labels:
name: "nginx"
spec:
containers:
-
name: "nginx"
image: "nginx"
volumeMounts:
- mountPath: /test-ebs
name: test-volume
volumes:
- name: test-volume
# This AWS EBS volume must already exist.
awsElasticBlockStore:
volumeID: aws://us-east-1c/vol-8499707e
fsType: ext4

I just stumbled across the same thing and found out after some digging, that they actually changed the volume mount syntax. Based on that knowledge I created this PR for documentation update. See https://github.com/kubernetes/kubernetes/pull/17958 for tracking that and more info, follow the link to the bug and the original change which doesn't include the doc update. (SO prevents me from posting more than two links apparently.)
If that still doesn't do the trick for you (as it does for me) it's probably because of https://stackoverflow.com/a/32960312/3212182 which will be fixed in one of the next releases I guess. At least I can't see it in the latest release notes.

Ensure that the volume is in the same availability zone as the node.
http://kubernetes.io/docs/user-guide/volumes/
If that's not the issue, are there any events in kubectl describe pod nginx?