Can I rely on volumeClaimTemplates naming convention? - kubernetes

I want to setup a pre-defined PostgreSQL cluster in a bare meta kubernetes 1.7 with local PV enable. I have three work nodes. I create local PV on each node and deploy the stateful set successfully (with some complex script to setup Postgres replication).
However I'm noticed that there's a kind of naming convention between the volumeClaimTemplates and PersistentVolumeClaim.
For example
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: postgres
volumeClaimTemplates:
- metadata:
name: pgvolume
The created pvc are pgvolume-postgres-0, pgvolume-postgres-1, pgvolume-postgres-2 .
With some tricky, I manually create PVC and bind to the target PV by selector. I test the stateful set again. It seems the stateful set is very happy to use these PVC.
I finish my test successfully but I still have this question. Can I rely on volumeClaimTemplates naming convention? Is this an undocumented feature?

Based on the statefulset API reference
volumeClaimTemplates is a list of claims that pods are allowed to reference. The StatefulSet controller is responsible for mapping network identities to claims in a way that maintains the identity of a pod. Every claim in this list must have at least one matching (by name) volumeMount in one container in the template. A claim in this list takes precedence over any volumes in the template, with the same name.
So I guess you can rely on it.
Moreover, you can define a storage class to leverage dynamic provisioning of persistent volumes, so you won't have to create them manually.
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: my-storage-class
resources:
requests:
storage: 1Gi
Please refer to Dynamic Provisioning and Storage Classes in Kubernetes for more details.

Related

What is the PersistentVolumeClaim policy for local PersistentVolume in Kubernetes?

Scenario 1:
I have 3 local-persistent-volumes provisioned, each pv is mounted on different node:
10.30.18.10
10.30.18.11
10.30.18.12
When I start my app with 3 replicas using:
kind: StatefulSet
metadata:
name: my-db
spec:
replicas: 3
...
...
volumeClaimTemplates:
- metadata:
name: my-local-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-local-sc"
resources:
requests:
storage: 10Gi
Then I notice pods and pvs are on the same host:
pod1 with ip 10.30.18.10 has claimed the pv that is mounted on 10.30.18.10
pod2 with ip 10.30.18.11 has claimed the pv that is mounted on 10.30.18.11
pod3 with ip 10.30.18.12 has claimed the pv that is mounted on 10.30.18.12
(whats not happening is: pod1 with ip 10.30.18.10 has claimed the pv that is mounted on different node 10.30.18.12 etc)
The only common config between pv and pvc is storageClassName, so I didn't configure this behavior.
Question:
So, who is responsible for this magic? Kubernetes scheduler? Kubernetes provisioner?
Scenario 2:
I have 3 local-persistent-volumes provisioned:
pv1 has capacity.storage of 10Gi
pv2 has capacity.storage of 100Gi
pv3 has capacity.storage of 100Gi
Now, I start my app with 1 replica
kind: StatefulSet
metadata:
name: my-db
spec:
replicas: 1
...
...
volumeClaimTemplates:
- metadata:
name: my-local-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-local-sc"
resources:
requests:
storage: 10Gi
I want to ensure that this StatefulSet always claim pv1 (10Gi) even this is on a different node, and don't claim pv2 (100Gi) and pv3 (100Gi)
Question:
Does this happen automatically?
How do I ensure the desired behavior? Should I use a separate storageClassName to ensure this?
What is the PersistentVolumeClaim policy? Where can I find more info?
EDIT:
yml used for StorageClass:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: my-local-pv
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
With local Persistent Volumes, this is the expected behaviour. Let me try to explain what happens when using local storage.
The usual setup for local storage on a cluster is the following:
A local storage class, configured to be WaitForFirstConsumer
A series of local persistent volumes, linked to the local storage class
And this is all well documented with examples in the official documentation: https://kubernetes.io/docs/concepts/storage/volumes/#local
With this done, Persistent Volume Claims can request storage from the local storage class and StatefulSets can have a volumeClaimTemplate which requests storage of the local storage class.
Let me take as example your StatefulSet with 3 replicas, each one requires local storage with the volumeClaimTemplate.
When the Pods are first created, they request a storage of the required storageClass. For example your my-local-sc
Since this storage class is manually created and does not support dynamically provisioning of new PVs (like, for example, Ceph or similar) it is checked if a PV attached to the storage class is available to be bound.
If a PV is selected, it is bound to the newly created PVC (and from now, can be used only with that particular PV, since it is now Bound)
Since the PV is of type local, the PV has a nodeAffinity required which selects a node.
This force the Pod, now bound to that PV, to be scheduled only on that particular node.
This is why each Pod was scheduled on the same node of the bounded persistent volume. And this means that the Pod is restricted to run on that node only.
You can test this easily by draining / cordoning one of the nodes and then trying to restart the Pod bound to the PV available on that particular node. What you should see is that the Pod will not start, as the PV is restricted from its nodeAffinity and the node is not available.
Once each Pod of the StatefulSet is bound to a PV, that Pod will be scheduled only on a specific node.. Pods will not change the PV that they are using, unless the PVC is removed (which will force the Pod to request again a new PV to bound)
Since local storage is handled manually, PV which were bounded and have the related PVC removed from the cluster, enter in Released state and cannot be claimed anymore, they must be handled by someone.. maybe deleting them and then recreating new ones at the same location (and maybe cleaning the filesystem as well, depending on the situation)
This means that local storage is OK to be used only:
If HA is not a problem.. for example, I don't care if my app is blocked by a single node not working
If HA is handled directly by the app itself. For example, a StatefulSet with 3 Pods like a multi-primary database (Galera, Clickhouse, Percona for examples) or ElasticSearch or Kafka, Zookeeper or something like that.. all will handle the HA on their own as they can resist one of their nodes being down as long as there's quorum.
UPDATE
Regarding the Scenario 2 of your question. Let's say you have multiple Available PVs and a single Pod which starts and wants to Bound to one of them. This is a normal behaviour and the control plane would select one of those PVs on its own (if they match with the requests in Claim)
There's a specific way to pre-bind a PV and a PVC, so that they will always bind together. This is described in the docs as "reserving a PV": https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reserving-a-persistentvolume
But the problem is that this cannot be applied to olume claim templates, as it requires the claim to be created manually with special properties.
The volume claim template tho, as a selector field which can be used to restrict the selection of a PV based on labels. It can be seen in the API specs ( https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#persistentvolumeclaimspec-v1-core )
When you create a PV, you label it with what you want.. for example you could label it like the following:
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-small-pv
labels:
size-category: small
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ssd1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node-1
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-big-pv
labels:
size-category: big
spec:
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ssd1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node-2
And then the claim template can select a category of volumes based on the label. Or maybe it doesn't care so it doesn't specify selector and can use all of them (provided that the size is enough for its claim request)
This could be useful.. but it's not the only way to select or restrict which PVs can be selected, because when the PV is first bound, if the storage class is WaitForFirstConsumer, the following is also applied:
Delaying volume binding ensures that the PersistentVolumeClaim binding
decision will also be evaluated with any other node constraints the
Pod may have, such as node resource requirements, node selectors, Pod
affinity, and Pod anti-affinity.
Which means that if the Pod has a node affinity to one of the nodes, it will select for sure a PV on that node (if the local storage class used is WaitForFirstConsumer)
Last, let me quote the offical documentation for things that I think they could answer your questions:
From https://kubernetes.io/docs/concepts/storage/persistent-volumes/
A user creates, or in the case of dynamic provisioning, has already
created, a PersistentVolumeClaim with a specific amount of storage
requested and with certain access modes. A control loop in the master
watches for new PVCs, finds a matching PV (if possible), and binds
them together. If a PV was dynamically provisioned for a new PVC, the
loop will always bind that PV to the PVC. Otherwise, the user will
always get at least what they asked for, but the volume may be in
excess of what was requested. Once bound, PersistentVolumeClaim binds
are exclusive, regardless of how they were bound. A PVC to PV binding
is a one-to-one mapping, using a ClaimRef which is a bi-directional
binding between the PersistentVolume and the PersistentVolumeClaim.
Claims will remain unbound indefinitely if a matching volume does not
exist. Claims will be bound as matching volumes become available. For
example, a cluster provisioned with many 50Gi PVs would not match a
PVC requesting 100Gi. The PVC can be bound when a 100Gi PV is added to
the cluster.
From https://kubernetes.io/docs/concepts/storage/volumes/#local
Compared to hostPath volumes, local volumes are used in a durable and
portable manner without manually scheduling pods to nodes. The system
is aware of the volume's node constraints by looking at the node
affinity on the PersistentVolume.
However, local volumes are subject to the availability of the
underlying node and are not suitable for all applications. If a node
becomes unhealthy, then the local volume becomes inaccessible by the
pod. The pod using this volume is unable to run. Applications using
local volumes must be able to tolerate this reduced availability, as
well as potential data loss, depending on the durability
characteristics of the underlying disk.

Kubernetes: Using an ordinal number in a claimName?

I have a statefulset that is running great and the stateful set has ReadWriteMany PVC. I need to share this PVC with another statefulset.
Does anybody know how I can add the ordinal number into the claimName.
Basically I have a backendService that is a statefulset with 2 replicas so it has a volumeClaimTemplate defined - hence it has 2 volumes service-data-service-0 and service-data-service-1 for example.
In the other statefulset - it has its own data volume but I need to share the data volume from the other statefulset.
There is a one to one mapping - meaning that the volume with ordinal 0 in the lower service needs to be added to pod0 and the same for volume with ordinal 1 to pod1.
I am little confused how I am able to do this. Its easy with a deployment, because technically you have 2 x deployments.. SO each deployment can be strictly sent to the correct service-data-service- XX (Where XX is the ordinal number of the lower server i.e 0,1 etc)
In my head, psuedo code - I have this. Can anyone help ?
volumes:
- name: lnd2-data-volume
persistentVolumeClaim:
# This volumes section is in the higher service but shares a data volume
# with the lower service
claimName: service-data-service-{{ "SOME TEMPLATE HERE to give me either 0 or 1 for the current POD ordinal number }}
Any ideas ?
To see TLDR version please go to the solution below.
What you are trying to achieve is not doable in Statefulset (STS) today.
Claims due to the design of StatefulSet controller need to have a unique identifiers, in order to be mapped to their corresponding pods, and cannot be reused between different StatefulSet applications.
So no matter, what claim name you specify within Volumes as part of Pod's template inside StatefulSet definition (e.g. claimName=service-data-service-0), it will be always overwritten by StatefulSet controller for each controlled by it Pod using the following naming scheme:
PVC name = claim.Name + set.Name + ordinal number
where:
claim.Name - claim on the list of STS's volumeClaimTemplates matching 'volumeMount' in PodTemplateSpec
set.Name - StatefulSet name
ordinal - Pod's (replicas - 1)
My observations:
The existing PVC (of ReadWriteMany mode) can be used by STS, only when you introduce the StatefulSet for the first time in your cluster (=it's not owned yet by other workload).
For example, the STS like this one:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: peb
spec:
...
volumeClaimTemplates:
- metadata:
name: fileserver-claim
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: ""
resources:
requests:
storage: 1Gi
would consume the existing PVC:
fileserver-claim-peb-0
with accompanying event seen in API server logs:
The PVC 'fileserver-claim-peb-0' already exists
and because there cannot be any different STS of the same name (Pod 'peb-0' is unique in the cluster likewise its claimName), your options are over here.
Solution:
Pre-provision manually couple of PVs, that use the same associated storage asset (e.g NFS based volumes supporting RWX access mode) and inside your STS on the list of PVCs reference by name (volumeName) the existing unbound PV, e.g:
...
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes:
- "ReadWriteOnce"
volumeName: fileserver-claim-peb
resources:
requests:
storage: 1Gi
I think this is a recipe to share the same data storage between different StatefulSet(s).

Is there any technical problem if I have one PVC to share the same volume across all statefulset replicas?

Kubernetes creates one PersistentVolume for each VolumeClaimTemplate definition on an statefulset. That makes each statefulset pod have its own storage that is not shared across the replicas. However, I would like to share the same volume across all the statefulset replicas.
It looks like the approach should be the following:
Create a PVC on the same namespace.
On the statefulset use Volumes to bound the PVC
Ensure that the PVC is ReadOnlyMany or ReadWriteMany
Assuming that my application is able to deal with any concurrency on the shared volume, is there any technical problem if I have one PVC to share the same volume across all statefulset replicas?
I wholeheartedly agree with the comments made by #Jonas and #David Maze:
You can do this, it should work. There is no need to use volumeClaimTemplates unless your app needs it.
Two obvious problems are that ReadWriteMany volumes are actually a little tricky to get (things like AWS EBS volumes are only ReadWriteOnce), and that many things you want to run in StatefulSets (like databases) want exclusive use of their filesystem space and use file locking to enforce this.
Answering on the question:
Is there any technical problem if I have one PVC to share the same volume across all statefulset replicas?
I'd say that this would mostly depend on the:
How the application would handle such scenario where it's having single PVC (writing concurrency).
Which storage solution are supported by your Kubernetes cluster (or could be implemented).
Subjectively speaking, I don't think there should be an issue when above points are acknowledged and aligned with the requirements and configuration that the cluster/applications allows.
From the application perspective, there is an inherent lack of the software we are talking about. Each application could behave differently and could require different tuning (look on the David Maze comment).
We do not also know anything about your infrastructure so it could be hard to point you potential issues. From the hardware perspective (Kubernetes cluster), this would inherently go into making a research on the particular storage solution that you would like to use. It could be different from cloud provider to cloud provider as well as on-premise solutions. You would need to check the requirements of your app to align it to the options you have.
Continuing on the matter of Volumes, I'd reckon the one of the important things would be accessModes.
Citing the official docs:
Access Modes
A PersistentVolume can be mounted on a host in any way supported by the resource provider. As shown in the table below, providers will have different capabilities and each PV's access modes are set to the specific modes supported by that particular volume. For example, NFS can support multiple read/write clients, but a specific NFS PV might be exported on the server as read-only. Each PV gets its own set of access modes describing that specific PV's capabilities.
The access modes are:
ReadWriteOnce -- the volume can be mounted as read-write by a single node
ReadOnlyMany -- the volume can be mounted read-only by many nodes
ReadWriteMany -- the volume can be mounted as read-write by many nodes
In the CLI, the access modes are abbreviated to:
RWO - ReadWriteOnce
ROX - ReadOnlyMany
RWX - ReadWriteMany
Kubernetes.io: Docs: Concepts: Storage: Persistent Volumes: Access modes
One of the issues you can run into is with the ReadWriteOnce when the PVC is mounted to the Node and sts-X (Pod) is scheduled onto a different Node but from the question, I'd reckon you already know about it.
However, I would like to share the same volume across all the statefulset replicas.
An example of a StatefulSet with a Volume that would be shared across all of the replicas could be following (modified example from Kubernetes documentation):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
# VOLUME START
volumeMounts:
- name: example-pvc
mountPath: /usr/share/nginx/html
volumes:
- name: example-pvc
persistentVolumeClaim:
claimName: pvc-for-sts
# VOLUME END
Additional resources:
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Statefulset
Kubernetes.io: Docs: Concepts: Storage: Persistent Volumes

Kubernetes trouble with StatefulSet and 3 PersistentVolumes

I'm in the process of creating a StatefulSet based on this yaml, that will have 3 replicas. I want each of the 3 pods to connect to a different PersistentVolume.
For the persistent volume I'm using 3 objects that look like this, with only the name changed (pvvolume, pvvolume2, pvvolume3):
kind: PersistentVolume
apiVersion: v1
metadata:
name: pvvolume
labels:
type: local
spec:
storageClassName: standard
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/nfs"
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: mongo-persistent-storage-mongo-0
The first of the 3 pods in the StatefulSet seems to be created without issue.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container.
Yet if I go to the tab showing PersistentVolumeClaims the second one that was created seems to have been successful.
If it was successful why does the pod think it failed?
I want each of the 3 pods to connect to a different PersistentVolume.
For that to work properly you will either need:
provisioner (in link you posted there are example how to set provisioner on aws, azure, googlecloud and minicube) or
volume capable of being mounted multiple times (such as nfs volume). Note however that in such a case all your pods read/write to the same folder and this can lead to issues when they are not meant to lock/write to same data concurrently. Usual use case for this is upload folder that pods are saving to, that is later used for reading only and such use cases. SQL Databases (such as mysql) on the other hand, are not meant to write to such shared folder.
Instead of either of mentioned requirements in your claim manifest you are using hostPath (pointing to /nfs) and set it to ReadWriteOnce (only one can use it). You are also using 'standard' as storage class and in url you gave there are fast and slow ones, so you probably created your storage class as well.
The second fails with the error pod has unbound PersistentVolumeClaims
Back-off restarting failed container
That is because first pod already took it's claim (read write once, host path) and second pod can't reuse same one if proper provisioner or access is not set up.
If it was successful why does the pod think it failed?
All PVC were successfully bound to accompanying PV. But you are never bounding second and third PVC to second or third pods. You are retrying with first claim on second pod, and first claim is already bound (to fist pod) in ReadWriteOnce mode and can't be bound to second pod as well and you are getting error...
Suggested approach
Since you reference /nfs as your host path, it may be safe to assume that you are using some kind of NFS-backed file system so here is one alternative setup that can get you to mount dynamically provisioned persistent volumes over nfs to as many pods in stateful set as you want
Notes:
This only answers original question of mounting persistent volumes across stateful set replicated pods with the assumption of nfs sharing.
NFS is not really advisable for dynamic data such as database. Usual use case is upload folder or moderate logging/backing up folder. Database (sql or no sql) is usually a no-no for nfs.
For mission/time critical applications you might want to time/stresstest carefully prior to taking this approach in production since both k8s and external pv are adding some layers/latency in-between. Although for some application this might suffice, be warned about it.
You have limited control of name for pv that are being dynamically created (k8s adds suffix to newly created, and reuses available old ones if told to do so), but k8s will keep them after pod get terminated and assign first available to new pod so you won't loose state/data. This is something you can control with policies though.
Steps:
for this to work you will first need to install nfs provisioner from here:
https://github.com/kubernetes-incubator/external-storage/tree/master/nfs. Mind you that installation is not complicated but has some steps where you have to take careful approach (permissions, setting up nfs shares etc) so it is not just fire-and-forget deployment. Take your time installing nfs provisioner correctly. Once this is properly set up you can continue with suggested manifests below:
Storage class manifest:
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: sc-nfs-persistent-volume
# if you changed this during provisioner installation, update also here
provisioner: example.com/nfs
Stateful Set (important excerpt only):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ss-my-app
spec:
replicas: 3
...
selector:
matchLabels:
app: my-app
tier: my-mongo-db
...
template:
metadata:
labels:
app: my-app
tier: my-mongo-db
spec:
...
containers:
- image: ...
...
volumeMounts:
- name: persistent-storage-mount
mountPath: /wherever/on/container/you/want/it/mounted
...
...
volumeClaimTemplates:
- metadata:
name: persistent-storage-mount
spec:
storageClassName: sc-nfs-persistent-volume
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 10Gi
...

How to dynamically create EBS volume with Kubernetes Persistence Volume

I'm aware you can use aws cli to create ebs volume and then get the Volume ID and add to PersistentVolume config as below under the volumeID.
I don't want to use aws cli to create the ebs volume, My question is, how do I use Kubernetes to create this ebs volume dynamically without using the cli ?
apiVersion: "v1"
kind: "PersistentVolume"
metadata:
name: "pv0001"
spec:
capacity:
storage: "5Gi"
accessModes:
- "ReadWriteOnce"
awsElasticBlockStore:
fsType: "ext4"
volumeID: "volume-ID"
By default this should be working on a decently provisioned cluster. Just have the storageClassName defined correctly on a matching PVC and a PV will et provisioned for it (no need to precreate PV object, just the claim)
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
Dynamic provisioning
When none of the static PVs the administrator created matches a user’s PersistentVolumeClaim, the cluster may try to dynamically provision a volume specially for the PVC. This provisioning is based on StorageClasses: the PVC must request a class and the administrator must have created and configured that class in order for dynamic provisioning to occur. Claims that request the class "" effectively disable dynamic provisioning for themselves
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#provisioning
Follow this:
https://docs.docker.com/ee/ucp/kubernetes/storage/configure-aws-storage/
Basically, instances must have an IAM Role to create/attach/detach/delete volumes on their own.