I'm a beginner in kubernetes, and when I was reading the book, I found that it is not recommended to use hostpath as the volume type for production environment, because it will lead to binding between pod and node, but if you don't use hostpath, then if you use other volume types, when reading and writing files, will it lead to extra network IO, and will this performance suffer? Will this have an additional performance impact?
hostpath is, as the name suggests, reading and writing from a place on the host where the pod is running. If the host goes down, or the pod gets evicted or otherwise removed from the node, that data is (normally) lost. This is why the "binding" is mentioned -- the pod must stay on that same node otherwise it will lose that data.
Using a volume type and having volumes provisioned is better as the disk and the pod can be reattached together on another node and you will not lose the data.
In terms of I/O, there would indeed be a miniscule difference, since you're no longer talking to the node's local disk but a mounted disk.
hostPath volumes are generally used for temporary files or storage that can be lost without impact to the pod, in much the same way you would use /tmp on a desktop machine/
To get a local volume you can use the volume type Local volume, but you need a local volume provisioner that can allocate and recycle volumes for you.
Since local volumes are disks on the host, there are no performance trade-offs. But it is more common to use network located volumes provided by a cloud provider, and they do have a latency trade-off.
Related
I currently have a single-node Kubernetes instance running on a VM. The disk attached to this VM is 100GB, but 'df -h' shows that the / partition only has 88GB available (other stuff is used for OS overhead etc...)
I have a kubernetes manifest that creates a 100GB local Persistent Volume.
I also have a pod creating a 100GB Persistent Volume Claim.
Both of these deploy and come up normally even though the entire VM does not even have 100GB available.
To make things even more complicated, the VM is thin provisioned, and only using 20 GB on the actual disk right now...
HOW IS THIS WORKING !?!?
The local provisioner does no size checks, nor is the size enforced anyway. The final "volume" is just a bind mount like with hostPath. The main reason local PVs exist is because hostPath isn't something the scheduler understands so in a multi-node scenario it won't restrict topology.
According to the documentation:
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned ... It is a resource in the cluster just like a node is a cluster resource...
So I was reading about all currently available plugins for PVs and I understand that for 3rd-party / out-of-cluster storage this doesn't matter (e.g. storing data in EBS, Azure or GCE disks) because there are no or very little implications when adding or removing nodes from a cluster. However, there are different ones such as (ignoring hostPath as that works only for single-node clusters):
csi
local
which (at least from what I've read in the docs) don't require 3rd-party vendors/software.
But also:
... local volumes are subject to the availability of the underlying node and are not suitable for all applications. If a node becomes unhealthy, then the local volume becomes inaccessible by the pod. The pod using this volume is unable to run. Applications using local volumes must be able to tolerate this reduced availability, as well as potential data loss, depending on the durability characteristics of the underlying disk.
The local PersistentVolume requires manual cleanup and deletion by the user if the external static provisioner is not used to manage the volume lifecycle.
Use-case
Let's say I have a single-node cluster with a single local PV and I want to add a new node to the cluster, so I have 2-node cluster (small numbers for simplicity).
Will the data from an already existing local PV be 1:1 replicated into the new node as in having one PV with 2 nodes of redundancy or is it strictly bound to the existing node only?
If the already existing PV can't be adjusted from 1 to 2 nodes, can a new PV (created from scratch) be created so it's 1:1 replicated between 2+ nodes on the cluster?
Alternatively if not, what would be the correct approach without using a 3rd-party out-of-cluster solution? Will using csi cause any change to the overall approach or is it the same with redundancy, just different "engine" under the hood?
Can a new PV be created so it's 1:1 replicated between 2+ nodes on the cluster?
None of the standard volume types are replicated at all. If you can use a volume type that supports ReadWriteMany access (most readily NFS) then multiple pods can use it simultaneously, but you would have to run the matching NFS server.
Of the volume types you reference:
hostPath is a directory on the node the pod happens to be running on. It's not a directory on any specific node, so if the pod gets recreated on a different node, it will refer to the same directory but on the new node, presumably with different content. Aside from basic test scenarios I'm not sure when a hostPath PersistentVolume would be useful.
local is a directory on a specific node, or at least following a node-affinity constraint. Kubernetes knows that not all storage can be mounted on every node, so this automatically constrains the pod to run on the node that has the directory (assuming the node still exists).
csi is an extremely generic extension mechanism, so that you can run storage drivers that aren't on the list you link to. There are some features that might be better supported by the CSI version of a storage backend than the in-tree version. (I'm familiar with AWS: the EBS CSI driver supports snapshots and resizing; the EFS CSI driver can dynamically provision NFS directories.)
In the specific case of a local test cluster (say, using kind) using a local volume will constrain pods to run on the node that has the data, which is more robust than using a hostPath volume. It won't replicate the data, though, so if the node with the data is deleted, the data goes away with it.
I am trying to monitor filesystem usage for pods in k8s. I am using Kubernetes (microk8s) and hostpath persistent volumes. I am running Kafka along with a number of producers to see what happens when I go past the PVC size limit among other things. I have tried getting information from the API server but it is not reported there. Since it is only using hostpath, that kind of makes sense. It is not a dynamic volume system. Doing df on the host just shows all of the volumes with the same utilization as the root filesystem. This is the same result using exec -- df within the container. There are no pvcRefs on the containers using api server, which kind of explains why the dashboard doesn't have this information. Is this a dead end or does someone have a way around this limitation? I am now wondering if the PVC limits will be enforced.
Since with hostPath your data is stored directly on the worker you won't be able to monitor the usage. Using hostPath has many drawbacks and while its good for testing it should not be used for some prod system. Keeping the data directly on the node is dangerous and in the case of node failure/replacement you will loose it. Other disadvantages are:
Pods created from the same pod template may behave differently on different nodes because of different hostPath file/dir contents on those nodes
Files or directories created with HostPath on the host are only writable by root. Which means, you either need to run your container process as root or modify the file permissions on the host to be writable by non-root user, which may lead to security issues
hostPath volumes should not be used with Statefulsets.
As you already found out it would be good idea to move on from hostPath towards something else.
I'm teaching myself Kubernetes with a 5 Rpi cluster, and I'm a bit confused by the way Kubernetes treats Persistent Volumes with respect to Pod Scheduling.
I have 4 worker nodes using ext4 formatted 64GB micro SD cards. It's not going to give GCP or AWS a run for their money, but it's a side project.
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
Should I be looking into distributed file systems like Ceph or Hdfs so that Pods aren't restricted to being scheduled on a particular node?
Sorry if this seems like a stupid question, I'm self taught and still trying to figure this stuff out! (Feel free to improve my tl;dr doc for kubernetes with a pull req)
just some examples, as already mentioned it depends on your storage system, as i see you use the local storage option
Local Storage:
yes the pod needs to be run on the same machine where the pv is located (your case)
ISCSI/Trident San:
no, the node will mount the iscsi block device where the pod will be scheduled
(as mentioned already volume binding mode is an important keyword, its possible you need to set this to 'WaitForFirstConsumer')
NFS/Trident Nas:
no, its nfs, mountable from everywhere if you can access and auth against it
VMWare VMDK's:
no, same as iscsi, the node which gets the pod scheduled mounts the vmdk from the datastore
ceph/rook.io:
no, you get 3 options for storage, file, block an object storage, every type is distributed so you can schedule a pod on every node.
also ceph is the ideal system for carrying a distributed software defined storage on commodity hardware, what i can recommend is https://rook.io/ basically an opensource ceph on 'container-steroids'
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
This is a good question. How this works depends on your storage system. The StorageClass defined for your Persistent Volume Claim contains information about Volume Binding Mode. It is common to use dynamic provisioning volumes, so that the volume is first allocated when a user/consumer/Pod is scheduled. And typically this volume does not exist on the local Node but remote in the same data center. Kubernetes also has support for Local Persistent Volumes that are physical volumes located on the same Node, but they are typically more expensive and used when you need high disk performance and volume.
I have a question regarding what is the best approach with K8S in AWS.
the way I see it that either I use the EBS directly for PV AND PVC or that I mount the EBS as a regular folder in my EC2 and then use those mounted folders for my PV and PVC.
what approach is better in your opinion?
it is important to notice that I want my K8s to Cloud agnostic so maybe forcing EBS configuration is less better that using a folder so the ec2 does not care what is the origin of the folder.
many thanks
what approach is better in your opinion?
Without question: using the PV and PVC. Half the reason will go here, and the other half below. By declaring those as managed resources, kubernetes will cheerfully take care of attaching the volumes to the Node it is scheduling the Pod upon, and detaching it from the Node when the Pod is unscheduled. That will matter in a huge way if a Node reboots, for example, because the attach-detach cycle will happen transparently, no Pager Duty involved. That will not be true if you are having to coordinate amongst your own instances who is alive and should have the volume attached at this moment.
it is important to notice that I want my K8s to Cloud agnostic so maybe forcing EBS configuration is less better that using a folder so the ec2 does not care what is the origin of the folder.
It still will be cloud agnostic, because what you have told kubernetes -- declaratively, I'll point out, using just text in a yaml file -- is that you wish for some persistent storage to be volume mounted into your container(s) before they are launched. Only drilling down into the nitty gritty will surface the fact that it's provided by an AWS EBS volume. I would almost guarantee you could move those descriptors over to GKE (or Azure's thing) with about 90% of the text exactly the same.