Why is my Host Path Persistent Volume reachable from all pods? - kubernetes

I'm pretty stuck with this learning step of Kubernetes named PV and PVC.
What I'm trying to do here is understand how to handle shared read-write volume on multiple pods.
What I understood here is that a PVC cannot be shared between pods unless a NFS-like storage class has been configured.
I'm still with my hostPath Storage Class and I tried the following (Docker Desktop and 3 nodes microK8s cluster) :
This PVC with dynamic Host Path provisionning
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-desktop
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Mi
Deployment with 3 replicated pods writing on the same PVC.
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
spec:
replicas: 3
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox
image: library/busybox:stable
command: ["/bin/sh"]
args:
["-c", 'while true; do echo "1: $(hostname)" >> /root/index.html; sleep 2; done;',]
volumeMounts:
- mountPath: /root
name: vol-desktop
volumes:
- name: vol-desktop
persistentVolumeClaim:
claimName: pvc-desktop
Nginx server for serving volume content
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:stable
volumeMounts:
- mountPath: /usr/share/nginx/html
name: vol-desktop
ports:
- containerPort: 80
volumes:
- name: vol-desktop
persistentVolumeClaim:
claimName: pvc-desktop
Following what I understood on the documentation, this could not be possible, but in reality everything run pretty smoothly and my Nginx server displayed the up to date index.html file pretty well.
It actually worked on a single-node cluster and multi-node cluster.
What am I not getting here? Why this thing works?
Is every pod mounting is own host path volume on start?
How can a hostPath storage works between multiple nodes?
EDIT: For the multi-node case, a network folder has been created between the same storage path of each machine this is why everything has been replicated successfully. I didn't understand that the same host path is created on each node with that PVC mounted.
To anyone with the same problem: each node mounting this hostpath PVC will have is own folder created at the PV path.
So without network replication between nodes, only pods of the same node will share the same folder.
This is why it's discouraged on a multi-node cluster due to the unpredictable location of a pod on the cluster.
Thanks!

how to handle shared read-write volume on multiple pods.
Redesign your application to avoid it. It tends to be fragile and difficult to manage multiple writers safely; you depend on both your application correctly performing things like file locking, the underlying shared filesystem implementation handling things properly, and the system being tolerant of any sort of network hiccup that might happen.
The example you give is something that frequently appears in Docker Compose setups: have an application with a mix of backend code and static files, and then try to publish the static files at runtime through a volume to a reverse proxy. Instead, you can build an image that copies the static files at build time:
FROM nginx
ARG app_version=latest
COPY --from=my/app:${app_version} /app/static /usr/share/nginx/html
Have your CI system build this and push it immediately after the backend image is built. The resulting image serves the corresponding static files, but doesn't require a shared volume or any manual management of the volume contents.
For other types of content, consider storing data in a database, or use an object-storage service that maintains its own backing store and can handle the concurrency considerations. Then most of your pods can be totally stateless, and you can manage the data separately (maybe even outside Kubernetes).
How can a hostPath storage works between multiple nodes?
It doesn't. It's an instruction to Kubernetes, on whichever node the pod happens to be scheduled on, to mount that host directory into the container. There's no management of any sort of the directory content; if two pods get scheduled on the same node, they'll share the directory, and if not, they won't; and if your pod's Deployment is updated and the pod is deleted and recreated somewhere else, it might not be the same node and might not have the same data.
With some very specific exceptions you shouldn't use hostPath volumes at all. The exceptions are things like log collectors run as DaemonSets, where there is exactly one pod on every node and you're interested in picking up the host-directory content that is different on each node.
In your specific setup either you're getting lucky with where the data producers and consumers are getting colocated, or there's something about your MicroK8s setup that's causing the host directories to be shared. It is not in general reliable storage.

Related

Monitoring Kubernetes PVC disk usage

I'm trying to monitor Kubernetes PVC disk usage. I need the memory that is in use for Persistent Volume Claim. I found the command:
kubectl get --raw / api / v1 / persistentvolumeclaims
Return:
"status":{
"phase":"Bound",
"accessModes":[
"ReadWriteOnce"
],
"capacity":{
"storage":"1Gi"
}
}
But it only brings me the full capacity of the disk, and as I said I need the used one
Does anyone know which command could return this information to me?
I don't have a definitive anwser, but I hope this will help you. Also, I would be interested if someone has a better anwser.
Get current usage
The PersistentVolume subsystem provides an API for users and administrators that abstracts details of how storage is provided from how it is consumed.
-- Persistent Volume | Kubernetes
As stated in the Kubernetes documentation, PV (PersistentVolume) and PVC (PersistentVolumeClaim) are abstractions over storage. As such, I do not think you can inspect PV or PVC, but you can inspect the storage medium.
To get the usage, create a debugging pod which will use your PVC, from which you will check the usage. This should work depending on your storage provider.
# volume-size-debugger.yaml
kind: Pod
apiVersion: v1
metadata:
name: volume-size-debugger
spec:
volumes:
- name: debug-pv
persistentVolumeClaim:
claimName: <pvc-name>
containers:
- name: debugger
image: busybox
command: ["sleep", "3600"]
volumeMounts:
- mountPath: "/data"
name: debug-pv
Apply the above manifest with kubectl apply -f volume-size-debugger.yaml, and run a shell inside it with kubectl exec -it volume-size-debugger sh. Inside the shell run du -sh to get the usage in a human readable format.
Monitoring
As I am sure you have noticed, this is not especially useful for monitoring. It may be useful for a one-time check from time to time, but not for monitoring or low disk space alerts.
One way to setup monitoring would be to have a similar sidecar pod like ours above and gather our metrics from there. One such example seems to be the node_exporter.
Another way would be to use CSI (Container Storage Interface). I have not used CSI and do not know enough about it to really explain more. But here are a couple of related issues and related Kubernetes documentation:
Monitoring Kubernetes PersistentVolumes - prometheus-operator
Volume stats missing - csi-digitalocean
Storage Capacity | Kubernetes
+1 to touchmarine's answer however I'd like to expand it a bit and add also my three cents.
But it only brings me the full capacity of the disk, and as I said I
need the used one
PVC is an abstraction which represents a request for a storage and simply doesn't store such information as disk usage. As a higher level abstraction it doesn't care at all how the underlying storage is used by its consumer.
#touchmarine, Instead of using a Pod whose only function is to sleep and every time you need to check the disk usage you need to attach to it maually, I would propose to use something like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
volumes:
- name: media
persistentVolumeClaim:
claimName: media
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
volumeMounts:
- mountPath: "/data"
name: media
- name: busybox
image: busybox
command: ["/bin/sh"]
args: ["-c", "while true; do du -sh /data; sleep 10;done"]
volumeMounts:
- mountPath: "/data"
name: media
It can be of course a single-container busybox Pod as in #touchmarine's example but here I decided to to show also how it can be used as a sidecar running next to nginx container within a single Pod.
As it runs a simple bash script - an infinite while loop, which prints out current disk usage to the standard output it can be read with kubectl logs without a need of using kubectl exec and attaching to the Pod:
$ kubectl logs nginx-deployment-56bb5c87f6-dqs5h busybox
20.0K /data
20.0K /data
20.0K /data
I guess it can be also used more effectively to configure some sort of monitoring of disk usage.

How to store my pod logs in a persistent storage?

I have generated logs for my pods using kubectl logs 'pod name. But I want to persist these logs in a volume (some kind of persistent storage), because container logs will get wiped out if the pods go down. Is there a way to do this? Do I have to write some sort of a script?
I have read many answers but I still do not understand how to go about it, any help is appreciated. Thanks!
Under Logging Architecture Kubernetes documents goes thru couple of way to set up loggin in your cluster.
The most interesting for you might be Cluster-level logging architecture:
While Kubernetes does not provide a native solution for cluster-level
logging, there are several common approaches you can consider. Here
are some options:
Use a node-level logging agent that runs on every node.
Include a dedicated sidecar container for logging in an application pod.
Push logs directly to a backend from within an application
There are many solutions for collecting pod logs and shipping them to a centralized location such as:
fluentd
splunk
elastic
Keeping logs outside of cluster has benefits. If you cluster begins to have issues its more likely that your inside logging architecure will also face them.
You will need to mount the logs directory inside the container to the host machine as well, using the PersistentVolume and PersistentVolumeClaim.
This way you can persist these logs even if the container is killed.
Create the PersistentVolume and PersistentVolumeClaim for the log path and use them as volume mounts to the kubernetes deployments or pods.
I know this is an old question, but I've just had the same problem and I've spent some time to figure out the solution, so I'd like to share a more detailed solution.
Like Aayush Mall said, you'll need the PersistentVolume and PersistentVolumeClaim objects to create the volume and then link it to the pod (preferably via a Deployment object).
Basically, The PersistentVolume would define how and where the volume would be stored in the host and the PersistentVolumeClaim would define the constraints to bind the volume to some container.
From the docs:
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see AccessModes).
So, let's say your pods are running in two nodes: mynode-1 and mynode-2.
Your PersistentVolume spec will look like this.
apiVersion: v1
kind: PersistentVolume
metadata:
name: myapp-log-pv
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /var/log/myapp
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- mynode-1
- mynode-2
Your PersistentVolumeClaim like this.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myapp-log-pvc
spec:
volumeMode: Filesystem
accessModes:
- ReadWriteMany
storageClassName: local-storage
resources:
requests:
storage: 2Gi
volumeName: myapp-log
And then, you just have to tell the deployment object how to mount the volume inside the container. So, your Deployment spec will look like this.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
spec:
selector:
matchLabels:
app: myapp
replicas: 1
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myrepo/myapp:latest
volumeMounts:
- name: log
mountPath: /var/log
volumes:
- name: log
persistentVolumeClaim:
claimName: myapp-log-pvc
And that's it. When your deployment starts, it'll create the pod with the container, mount a volume named log for the path /var/log (inside the container) and bound this volume to some PV matching the requirements of the PVC named myapp-log-pvc. As we've created the myapp-log-pv with the same volumeMode, accessModes and storageClassName fields and with more storage capacity then the required by myapp-log-pvc, they will be bound. So, your app logs will be stored in the path /var/log/myapp (field spec.local.path in the myapp-log-pv spec) inside the node running the pod.
I hope it help :)
Also, I'm kinda new in the kubernetes world, so please let me know if you notice I misunderstood something or if there is a better way to do this.

Volume is already exclusively attached to one node and can't be attached to another

I have a pretty simple Kubernetes pod. I want a stateful set and want the following process:
I want to have an initcontainer download and uncompress a tarball from s3 into a volume mounted to the initcontainer
I want to mount that volume to my main container to be used
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: app
namespace: test
labels:
name: app
spec:
serviceName: app
replicas: 1
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
initContainers:
- name: preparing
image: alpine:3.8
imagePullPolicy: IfNotPresent
command:
- "sh"
- "-c"
- |
echo "Downloading data"
wget https://s3.amazonaws.com/.........
tar -xvzf xxxx-........ -C /root/
volumeMounts:
- name: node-volume
mountPath: /root/data/
containers:
- name: main-container
image: ecr.us-west-2.amazonaws.com/image/:latest
imagePullPolicy: Always
volumeMounts:
- name: node-volume
mountPath: /root/data/
volumeClaimTemplates:
- metadata:
name: node-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: gp2-b
resources:
requests:
storage: 80Gi
I continue to get the following error:
At first I run this and I can see the logs flowing of my tarball being downloaded by the initcontainer. About half way done it terminates and gives me the following error:
Multi-Attach error for volume "pvc-faedc8" Volume is
already exclusively attached to one node and can't be
attached to another
Looks like you have a dangling PVC and/or PV that is attached to one of your nodes. You can ssh into the node and run a df or mount to check.
If you look at this the PVCs in a StatefulSet are always mapped to their pod names, so it may be possible that you still have a dangling pod(?)
If you have a dangling pod:
$ kubectl -n test delete pod <pod-name>
You may have to force it:
$ kubectl -n test delete pod <pod-name> --grace-period=0 --force
Then, you can try deleting the PVC and it's corresponding PV:
$ kubectl delete pvc pvc-faedc8
$ kubectl delete pv <pv-name>
I had the same issue right now and the problem was, that the node on which the pod is usually running on was down and another one took over (which didn't work as expected for whatever reason). Had the "node down" scenario a few times before already and it never caused any issues. Couldn't get the StatefulSet and Deployment back up and running without booting the node that was down. But as soon as the node was up and running again the StatefulSet and Deployment immediately came back to life as well.
I had a similar error:
The volume pvc-2885ea01-f4fb-11eb-9528-00505698bd8b
cannot be attached to the node node1 since it is already attached to the node node2*
I use longhorn as a storage provisioner and manager. So I just detached this pv in the error and restarted the stateful set. It automatically was able to attach to the pv correctly this time.
I'll add an answer that will prevent this from happening again.
Short answer
Access modes: Switch from ReadWriteOnce to ReadWriteMany.
In a bit more details
You're usng a StatefulSet where each replica has its own state, with a unique persistent volume claim (PVC) created for each pod.
Each PVC is referring to a Persistent Volume where you decided that the access mode is ReadWriteOnce.
Which as you can see from here:
ReadWriteOnce
the volume can be mounted as read-write by a single
node. ReadWriteOnce access mode still can allow multiple pods to
access the volume when the pods are running on the same node.
So in case K8S Scheduler (due to priorities or resource calculations or due to a Cluster autoscaler which decided to shift the pod to a different node) - you will receive an error that the volume is already exclusively attached to one node and can't be
attached to another node.
Please consider using ReadWriteMany where the volume can be mounted as read-write by many nodes.

How to write to gcePersistentDisk while mounted to multiple Kubernetes pod

I'm currently mounting a gcePersistentDisk to each pod in my kubernetes deployment. Since I want multiple pods to read from the disk, I have to mount it as read only. My deployment yaml file looks like this:
apiVersion: extensions/v1beta1
kind: Deployment
spec:
replicas: 1
...
...
template:
...
...
spec:
containers:
- image: ...
...
...
volumeMounts:
- mountPath: /my-volume
name: my-volume
readOnly: true
...
...
volumes:
- name: my-storage
gcePersistentDisk:
pdName: my-disk
fsType: ext4
readOnly: true
Right now, in order to write new stuff to the disk, I need to scale the deployment to 0, then start a kubernetes job that mounts the disk to a single pod that has read / write access, write to the disk and then scale the deployment up again.
Is there a way I can do this without taking down all my pods?
Is it possible/recommended to do something like "hot-swapping" persistent disks in kubernetes deployments?
Looking at the requirements:
1)- No other choice with the current use-case. Pods need to be scaled down every time.
2)- You can use a different type of PV, then use ReadWriteMany access mode [1] & [2].
3)- hot-swap: meaning changing the deployment (kubectl apply)? Not sure, need clarification.
4)- Another option is to use NFS [2], but that obviously is a whole different approach.
[1] https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes#access_modes
[2] Access Modes https://kubernetes.io/docs/concepts/storage/persistent-volumes/

Kubernetes trouble with StatefulSet and 3 PersistentVolumes

I'm in the process of creating a StatefulSet based on this yaml, that will have 3 replicas. I want each of the 3 pods to connect to a different PersistentVolume.
For the persistent volume I'm using 3 objects that look like this, with only the name changed (pvvolume, pvvolume2, pvvolume3):
kind: PersistentVolume
apiVersion: v1
metadata:
name: pvvolume
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/nfs"
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: mongo-persistent-storage-mongo-0
The first of the 3 pods in the StatefulSet seems to be created without issue.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container.
Yet if I go to the tab showing PersistentVolumeClaims the second one that was created seems to have been successful.
If it was successful why does the pod think it failed?
I want each of the 3 pods to connect to a different PersistentVolume.
For that to work properly you will either need:
provisioner (in link you posted there are example how to set provisioner on aws, azure, googlecloud and minicube) or
volume capable of being mounted multiple times (such as nfs volume). Note however that in such a case all your pods read/write to the same folder and this can lead to issues when they are not meant to lock/write to same data concurrently. Usual use case for this is upload folder that pods are saving to, that is later used for reading only and such use cases. SQL Databases (such as mysql) on the other hand, are not meant to write to such shared folder.
Instead of either of mentioned requirements in your claim manifest you are using hostPath (pointing to /nfs) and set it to ReadWriteOnce (only one can use it). You are also using 'standard' as storage class and in url you gave there are fast and slow ones, so you probably created your storage class as well.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container
That is because first pod already took it's claim (read write once, host path) and second pod can't reuse same one if proper provisioner or access is not set up.
If it was successful why does the pod think it failed?
All PVC were successfully bound to accompanying PV. But you are never bounding second and third PVC to second or third pods. You are retrying with first claim on second pod, and first claim is already bound (to fist pod) in ReadWriteOnce mode and can't be bound to second pod as well and you are getting error...
Suggested approach
Since you reference /nfs as your host path, it may be safe to assume that you are using some kind of NFS-backed file system so here is one alternative setup that can get you to mount dynamically provisioned persistent volumes over nfs to as many pods in stateful set as you want
Notes:
This only answers original question of mounting persistent volumes across stateful set replicated pods with the assumption of nfs sharing.
NFS is not really advisable for dynamic data such as database. Usual use case is upload folder or moderate logging/backing up folder. Database (sql or no sql) is usually a no-no for nfs.
For mission/time critical applications you might want to time/stresstest carefully prior to taking this approach in production since both k8s and external pv are adding some layers/latency in-between. Although for some application this might suffice, be warned about it.
You have limited control of name for pv that are being dynamically created (k8s adds suffix to newly created, and reuses available old ones if told to do so), but k8s will keep them after pod get terminated and assign first available to new pod so you won't loose state/data. This is something you can control with policies though.
Steps:
for this to work you will first need to install nfs provisioner from here:
https://github.com/kubernetes-incubator/external-storage/tree/master/nfs. Mind you that installation is not complicated but has some steps where you have to take careful approach (permissions, setting up nfs shares etc) so it is not just fire-and-forget deployment. Take your time installing nfs provisioner correctly. Once this is properly set up you can continue with suggested manifests below:
Storage class manifest:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: sc-nfs-persistent-volume
# if you changed this during provisioner installation, update also here
provisioner: example.com/nfs
Stateful Set (important excerpt only):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ss-my-app
spec:
replicas: 3
...
selector:
matchLabels:
app: my-app
tier: my-mongo-db
...
template:
metadata:
labels:
app: my-app
tier: my-mongo-db
spec:
...
containers:
- image: ...
...
volumeMounts:
- name: persistent-storage-mount
mountPath: /wherever/on/container/you/want/it/mounted
...
...
volumeClaimTemplates:
- metadata:
name: persistent-storage-mount
spec:
storageClassName: sc-nfs-persistent-volume
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 10Gi
...