In our cluster we have a horizontally-scaling deployment of an application that uses a lot of local disk space, which has been causing major cluster stability problems (docker crashes, nodes recreate, etc).
We are trying to have each pod provision a gcePersistentDisk of its own so its disk usage is isolated from the cluster. We created a storage class and a persistent volume claim that uses that class, and have specified a volume mount for that claim in our deployment's pod template spec.
However, when we set the autoscaler to use multiple replicas, they apparently try to use the same volume, and we get this error:
Multi-Attach error for volume
Volume is already exclusively attached to one node and can't be attached to another
Here are the relevant parts of our manifests. Storage Class:
{
"apiVersion": "storage.k8s.io/v1",
"kind": "StorageClass",
"metadata": {
"annotations": {},
"name": "some-storage",
"namespace": ""
},
"parameters": {
"type": "pd-standard"
},
"provisioner": "kubernetes.io/gce-pd"
}
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: some-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: some-class
Deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: some-deployment
spec:
volumes:
- name: some-storage
persistentVolumeClaim:
claimName: some-pvc
containers:
[omitted]
volumeMounts:
- name: some-storage
mountPath: /var/path
With those applied, we update the deployment's autoscaler to a minimum of 2 replicas and get the above error.
Is this not how persistent volume claims should work?
We definitely don't care about volume sharing, and we don't really care about persistence, we just want storage that is isolated from the cluster -- is this the right tool for the job?
A Deployment is meant to be stateless. There is no way for the deployment controller to determine which disk belongs to which pod once a pod gets rescheduled, which would lead to corrupted state. That is the reason why a Deployment can only have one disk shared across all its pods.
Concerning the error you are seeing:
Multi-Attach error for volume
Volume is already exclusively attached to one node and can't be attached to another
You are getting this because you have pods across multiple nodes, but only one volume (because a Deployment can only have one) and multiple nodes are trying to mount this volume to attach it to your deployments pods. The volume doesn't seem to be NFS which could be mounted into multiple nodes at the same time. If you do not care about state at all and still want to use a Deployment, then you must use a disk that supports mounts from multiple nodes at the same time, like NFS. Further, you would need to change your PVCs accessModes policy to ReadWriteMany, as multiple pods would write to the same physical volume.
If you need a dedicated disk for each pod, then you might want to use a StatefulSet instead. As the name suggests, its pods are meant to keep state, thus you can also define a volumeClaimTemplates section in it, which will create a dedicated disk for each pod as described in the documentation.
Related
Im new to kubernetes and this topic is confusing for me. I've learned that stateful set doesn't share the PV and each replica has it's own PV. On the other hand I saw the examples when one was using one pvc in stateful set with many replicas. So my question is what will happen then? As PVC to PV are bind 1:1 so one pvc can only bind to one pv, but each replica should have its own PV so how is it possible to have one pvc in stateful set in this scenario?
You should usually use a volume claim template with a StatefulSet. As you note in the question, this will create a new PersistentVolumeClaim (and a new PersistentVolume) for each replica. Data is not shared, except to the extent the container process knows how to replicate data between its replicas. If a StatefulSet Pod is deleted and recreated, it will come back with the same underlying PVC and the same data, even if it is recreated on a different Node.
spec:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
template:
spec:
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
You're allowed to manually create a PVC and attach it to the StatefulSet Pods
# not recommended -- one PVC shared across all replicas
spec:
template:
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: manually-created-pvc
containers:
- name: name
volumeMounts:
- name: data
mountPath: /data
but in this case the single PVC/PV will be shared across all of the replicas. This often doesn't work well: things like database containers have explicit checks that their storage isn't shared, and there is a range of concurrency problems that are possible doing this. This also can prevent pods from starting up since the volume types that are straightforward to get generally only support a ReadWriteOnce access mode; to get ReadWriteMany you need to additionally configure something like an NFS server outside the cluster.
i am not sure which example you were following and checked that scenario however yes PV and PVC is 1:1 mapping.
Usually, PVC gets attached to POD with access mode ReadWriteOnly which mean only one pod can do ReadWrite.
The scenario that you have might be seen could be something like there is a single PVC and single PV attach to multiple replicas which could be due to ReadWriteMany.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It
is similar to a Pod. Pods consume node resources and PVCs consume PV
resources. Pods can request specific levels of resources (CPU and
Memory). Claims can request specific size and access modes (e.g., they
can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see
AccessModes).
Read more about access mode here : https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
NFS, EFS and other type of storage support the ReadWriteMany access mode.
When I deploy e.g. nginx as SS and I use one PVC only one PV is created and storage is shared between all replicas.
You experiment is correct, this is possible because the scheduler has assigned all of the pods on the same node due to the dependency to the PV. If the node runs out of resources and result to a pod gets schedule on another node, that pod will enter pending state.
I have generated logs for my pods using kubectl logs 'pod name. But I want to persist these logs in a volume (some kind of persistent storage), because container logs will get wiped out if the pods go down. Is there a way to do this? Do I have to write some sort of a script?
I have read many answers but I still do not understand how to go about it, any help is appreciated. Thanks!
Under Logging Architecture Kubernetes documents goes thru couple of way to set up loggin in your cluster.
The most interesting for you might be Cluster-level logging architecture:
While Kubernetes does not provide a native solution for cluster-level
logging, there are several common approaches you can consider. Here
are some options:
Use a node-level logging agent that runs on every node.
Include a dedicated sidecar container for logging in an application pod.
Push logs directly to a backend from within an application
There are many solutions for collecting pod logs and shipping them to a centralized location such as:
fluentd
splunk
elastic
Keeping logs outside of cluster has benefits. If you cluster begins to have issues its more likely that your inside logging architecure will also face them.
You will need to mount the logs directory inside the container to the host machine as well, using the PersistentVolume and PersistentVolumeClaim.
This way you can persist these logs even if the container is killed.
Create the PersistentVolume and PersistentVolumeClaim for the log path and use them as volume mounts to the kubernetes deployments or pods.
I know this is an old question, but I've just had the same problem and I've spent some time to figure out the solution, so I'd like to share a more detailed solution.
Like Aayush Mall said, you'll need the PersistentVolume and PersistentVolumeClaim objects to create the volume and then link it to the pod (preferably via a Deployment object).
Basically, The PersistentVolume would define how and where the volume would be stored in the host and the PersistentVolumeClaim would define the constraints to bind the volume to some container.
From the docs:
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see AccessModes).
So, let's say your pods are running in two nodes: mynode-1 and mynode-2.
Your PersistentVolume spec will look like this.
apiVersion: v1
kind: PersistentVolume
metadata:
name: myapp-log-pv
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /var/log/myapp
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- mynode-1
- mynode-2
Your PersistentVolumeClaim like this.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myapp-log-pvc
spec:
volumeMode: Filesystem
accessModes:
- ReadWriteMany
storageClassName: local-storage
resources:
requests:
storage: 2Gi
volumeName: myapp-log
And then, you just have to tell the deployment object how to mount the volume inside the container. So, your Deployment spec will look like this.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
spec:
selector:
matchLabels:
app: myapp
replicas: 1
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myrepo/myapp:latest
volumeMounts:
- name: log
mountPath: /var/log
volumes:
- name: log
persistentVolumeClaim:
claimName: myapp-log-pvc
And that's it. When your deployment starts, it'll create the pod with the container, mount a volume named log for the path /var/log (inside the container) and bound this volume to some PV matching the requirements of the PVC named myapp-log-pvc. As we've created the myapp-log-pv with the same volumeMode, accessModes and storageClassName fields and with more storage capacity then the required by myapp-log-pvc, they will be bound. So, your app logs will be stored in the path /var/log/myapp (field spec.local.path in the myapp-log-pv spec) inside the node running the pod.
I hope it help :)
Also, I'm kinda new in the kubernetes world, so please let me know if you notice I misunderstood something or if there is a better way to do this.
I have deploy kubernetes cluster with stateful pod for mysql. for each pod i have different pvc.
for example : if 3 pod thn 3 5GB EBS PVC
SO Which way is better using one PVC for all pods or use different pvc for each pod.
StatefulSet must use volumeClaimTemplates if you want to have dedicated storage for each pod of a set. Based on that template PersistentVolumeClaim for each pod is created and configured the volume to be bound to that claim. The generated PersistentVolumeClaims names consists of volumeClaimTemplate name + pod-name + ordinal number.
So if you add volumeClaimTemplate part to your StatefulSet YAML(and delete specific persistentVolumeClaim references), smth like that:
volumeClaimTemplates:
- metadata:
name: mysql-data
spec:
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
Then go and create your StatefulSet and after to examine one of its pods (kubectl get pods pod-name-0 yaml) you’ll see smth like that(volumes part of the output):
volumes:
- name: mysql-data
persistentVolumeClaim:
claimName: mysql-data-pod-name-0. | dynamically created claim based on the template
So by using volumeClaimTemplates you don’t need to create a separate PVCs yourself and then in each separate StatefulSet reference that PVC to be mounted in your container at a specific mountPath(remember that each pod of a set must reference a different PVC, 1PVC-1PV) :
Part of “containers” definition of your Statefulset YAML:
volumeMounts:
- name: mysql-data || references your PVC by -name(not PVC name itself)
mountPath: /var/lib/mysql
So to aim for each pod of the set to have dedicated storage and not use volumeClaimTemplates is leading to a lot of problems and over complications to manage and scale it.
A PVC gets bound to a specific PV. For a StatefulSet, in most imaginable cases, you want to have a PV that can be accessed only by certain pod, so that data is not corrupted by write attempt from a parallel process/pod (RWO rather then RWX mode).
With that in mind, you need a PVC per replica in StatefulSet. Creating PVCs for replicas would get problematic very quickly if done manualy, this is why the right way to do it is to use volumeClaimTemplates that will dynamicaly create PVCs for you as you scale your set.
I have a docker image that when created should check if the volume is empty, in case it should initialize it with some data.
This saved data must remain available for other pods with the same or different image.
What do you recommend me to do?
You have 2 options:
First option is to mount the pod into the node and save the data in the node so when new pod will create in the same node it will have an access to the same volume (persistent storage location).
Potential problem: 2 pods on the same node can create deadlock for the same resource (so you have to manage the resource).
Shared storage meaning create one storage and every pod will claim storage in the same storage.
I strongly suggest that you will take the next 55 minutes and see the webinar below:
https://www.youtube.com/watch?v=n06kKYS6LZE
I assume you create your pods using Deployment object in Kubernetes. What you want to look into is a StatefulSet, which, in opposite to deployments, retains some identity aspects for recreated pods including to some extent network and storage.
It was introduced specifically as a means to run services that need to keep their state in kube cluster (ie. running databases queues etc.)
Looking at the answers, would it not be simpler to create an NFS Persistent Volume and then allow the pods to mount the PV's?
You can use the writemany which should alleviate a deadlock.
apiVersion: v1
kind: PersistentVolume
metadata:
name: shared-volume
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: ""
mountOptions:
- hard
- nfsvers=4.1
nfs:
path: /tmp
server: 172.17.0.2
Persistent Volumes
I'm in the process of creating a StatefulSet based on this yaml, that will have 3 replicas. I want each of the 3 pods to connect to a different PersistentVolume.
For the persistent volume I'm using 3 objects that look like this, with only the name changed (pvvolume, pvvolume2, pvvolume3):
kind: PersistentVolume
apiVersion: v1
metadata:
name: pvvolume
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/nfs"
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: mongo-persistent-storage-mongo-0
The first of the 3 pods in the StatefulSet seems to be created without issue.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container.
Yet if I go to the tab showing PersistentVolumeClaims the second one that was created seems to have been successful.
If it was successful why does the pod think it failed?
I want each of the 3 pods to connect to a different PersistentVolume.
For that to work properly you will either need:
provisioner (in link you posted there are example how to set provisioner on aws, azure, googlecloud and minicube) or
volume capable of being mounted multiple times (such as nfs volume). Note however that in such a case all your pods read/write to the same folder and this can lead to issues when they are not meant to lock/write to same data concurrently. Usual use case for this is upload folder that pods are saving to, that is later used for reading only and such use cases. SQL Databases (such as mysql) on the other hand, are not meant to write to such shared folder.
Instead of either of mentioned requirements in your claim manifest you are using hostPath (pointing to /nfs) and set it to ReadWriteOnce (only one can use it). You are also using 'standard' as storage class and in url you gave there are fast and slow ones, so you probably created your storage class as well.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container
That is because first pod already took it's claim (read write once, host path) and second pod can't reuse same one if proper provisioner or access is not set up.
If it was successful why does the pod think it failed?
All PVC were successfully bound to accompanying PV. But you are never bounding second and third PVC to second or third pods. You are retrying with first claim on second pod, and first claim is already bound (to fist pod) in ReadWriteOnce mode and can't be bound to second pod as well and you are getting error...
Suggested approach
Since you reference /nfs as your host path, it may be safe to assume that you are using some kind of NFS-backed file system so here is one alternative setup that can get you to mount dynamically provisioned persistent volumes over nfs to as many pods in stateful set as you want
Notes:
This only answers original question of mounting persistent volumes across stateful set replicated pods with the assumption of nfs sharing.
NFS is not really advisable for dynamic data such as database. Usual use case is upload folder or moderate logging/backing up folder. Database (sql or no sql) is usually a no-no for nfs.
For mission/time critical applications you might want to time/stresstest carefully prior to taking this approach in production since both k8s and external pv are adding some layers/latency in-between. Although for some application this might suffice, be warned about it.
You have limited control of name for pv that are being dynamically created (k8s adds suffix to newly created, and reuses available old ones if told to do so), but k8s will keep them after pod get terminated and assign first available to new pod so you won't loose state/data. This is something you can control with policies though.
Steps:
for this to work you will first need to install nfs provisioner from here:
https://github.com/kubernetes-incubator/external-storage/tree/master/nfs. Mind you that installation is not complicated but has some steps where you have to take careful approach (permissions, setting up nfs shares etc) so it is not just fire-and-forget deployment. Take your time installing nfs provisioner correctly. Once this is properly set up you can continue with suggested manifests below:
Storage class manifest:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: sc-nfs-persistent-volume
# if you changed this during provisioner installation, update also here
provisioner: example.com/nfs
Stateful Set (important excerpt only):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ss-my-app
spec:
replicas: 3
...
selector:
matchLabels:
app: my-app
tier: my-mongo-db
...
template:
metadata:
labels:
app: my-app
tier: my-mongo-db
spec:
...
containers:
- image: ...
...
volumeMounts:
- name: persistent-storage-mount
mountPath: /wherever/on/container/you/want/it/mounted
...
...
volumeClaimTemplates:
- metadata:
name: persistent-storage-mount
spec:
storageClassName: sc-nfs-persistent-volume
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 10Gi
...