Assigning a volume for log dispatching with K8s - kubernetes

I want to ship my application logs from my K8s pods to hosted ELK. I want to used a PVC. This Link mentions that the pod should be owned by a StatefulSet. Is this a general recommendation or specific to DigitalOcean?

This is a general recommendation. The reason is, you will have to ensure that your logging pods should be ready before the application pods can start. Additionally, say you had a Pod named logging-pod-1 sending logs of application pod app-pod-1 to elk, and the logging-pod-1 crashes for some reason, using a StatefulSet ensure that the new pod started assumes the identity of logging-pod-1, and this way the "state" of the operation can be maintained across failures/rescheduling/restarts, ensuring that no logs are missed and logging isn't affected.
This part of "assuming the identity" can mean something like, when the new pod comes up, k8s will know which PVC to attach to this pod. This will be the same PVC that the crashed pod was using, and therefore the new pod will simply "take over the task" by mounting the same PV and starting on its work.
This is a very common design pattern and since you need to ensure ordering, a StatefulSet is a natural choice for deploying this logging functionality. You can learn more about StatefulSets here in the docs.

Related

Why not to use Kubernetes StatefulSet for stateless applications?

I know why use StatefulSet for stateful applications. (e.g. DB or something)
In most cases, I can see like "You want to deploy stateful app to k8s? Use StatefulSet!"
However, I couldn't see like "You want to deploy stateless app to k8s? Then, DO NOT USE StatefulSet" ever.
Even nobody says "I don't recommend to use StatefulSet for stateless app", many stateless apps is deployed through Deployment, like it is the standard.
The StatefulSet has clear pros for stateful app, but I think Deployment doesn't for stateless app.
Is there any pros in Deployment for stateless apps? Or is there any clear cons in StatefulSet for stateless apps?
I supposed that StatefulSet cannot use LoadBalancer Service or StatefulSet has penalty to use HPA, but all these are wrong.
I'm really curious about this question.
P.S. Precondition is the stateless app also uses the PV, but not persists stateful data, for example logs.
I googled "When not to use StatefulSet", "when Deployment is better than StatefulSet", "Why Deployment is used for stateless apps", or something more questions.
I also see the k8s docs about StatefulSet either.
Different Priorities
What happens when a Node becomes unreachable in a cluster?
Deployment - Stateless apps
You want to maximize availability. As soon as Kubernetes detects that there are fewer than the desired number of replicas running in your cluster, the controllers spawn new replicas of it. Since these apps are stateless, it is very easy to do for the Kubernetes controllers.
StatefulSet - Stateful apps
You want to maximize availability - but not you must ensure data consistency (the state). To ensure data consistency, each replica has its own unique ID, and there are never multiple replicas of this ID, e.g. it is unique. This means that you cannot spawn up a new replica, unless that you are sure that the replica on the unreachable Node are terminated (e.g. stops using the Persistent Volume).
Conclusion
Both Deployment and StatefulSet try to maximize the availability - but StatefulSet cannot sacrifice data consistency (e.g. your state), so it cannot act as fast as Deployment (stateless) apps can.
These priorities does not only happens when a Node becomes unreachable, but at all times, e.g. also during upgrades and deployments.
In contrast to a Kubernetes Deployment, where pods are easily replaceable, each pod in a StatefulSet is given a name and treated individually. Pods with distinct identities are necessary for stateful applications.
This implies that if any pod perishes, it will be apparent right away. StatefulSets act as controllers but do not generate ReplicaSets; rather, they generate pods with distinctive names that follow a predefined pattern. The ordinal index appears in the DNS name of a pod. A distinct persistent volume claim (PVC) is created for each pod, and each replica in a StatefulSet has its own state.
For instance, a StatefulSet with four replicas generates four pods, each of which has its own volume, or four PVCs. StatefulSets require a headless service to return the IPs of the associated pods and enable direct interaction with them. The headless service has a service IP but no IP address and has to be created separately.The major components of a StatefulSet are the set itself, the persistent volume and the headless service.
That all being said, people deploy Stateful Applications with Deployments, usually they mount a RWX PV into the pods so all "frontends" share the same backend. Quite common in CNCF projects.
A stateful set manages each POD with a unique hostname based on an index number. So with an index, it would be easy to identify the individual PODs and also easy for the application to check which on rely or unique network identities. Also, you might have read stateful sets get deleted in a specified order to maintain consistency.
When you use stateful for the stateless application it will be like a burden to manage and add complexity to unique network identities and ordering guarantees.
For example, when you scale down to zero stateful sets it goes in the controlled way while with deployment or RS it won't be the same case. However, there is no guarantee when deleting the resource stateful set.
Also, Before a scaling operation is applied to a stateful set Pod, all of its predecessors must be Running and Ready. So if you are deploying the application, three Pods will be deployed suppose in order app-0, app-1, app-2. app-1 wont be deployed before app-0 is Running & Ready, and app-2 wont be deployed until app-1 is Ready.
While with deployment you can manage the % for and handle the RollingUpdate scenario but with a stateful set it will delete and recreate new POD one by one.

How to tell Kubernetes to not reschedule a pod unless it dies?

Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?

New PVC for an active pod

Is it possible to plug and play storage to an active pod without restarting the pod? I want to bind a new storage to a running pod without restarting the pod. Does Kubernetes support this?
Most things in a Pod are immutable. In particular if you look at the API definition of a PodSpec it says in part (emphasis mine)
container: List of containers belonging to the pod. Containers cannot currently be added or removed. There must be at least one container in a Pod. Cannot be updated.
Typically you don't directly work with Pods; you work with a higher-level controller like a Deployment. There you can edit these things, and it reacts by creating new Pods with the new pod spec and then deleting the old Pods.
Also remember that sometimes the cluster itself will delete or restart a Pod (if its Node is over capacity or fails, for example) and you don't have any control over this. It's better to plan for your Pods to be periodically restarted than to try to prevent it.

How to delete pods inside "default" namespace permanently. As when I delete only pods it is coming back because of "replication controller"

"How to permanently delete pods inside "default" namespace? As when I delete pods, they are coming back because of "replication controller".
As this is in a Default namespace, I am sure that we can delete it permanently. Any idea how to do it ?
I'd like to add some update to what was already said in previous answer.
Basically in kubernetes you have several abstraction layers. As you can read in the documentation:
A Pod is the basic execution unit of a Kubernetes application–the
smallest and simplest unit in the Kubernetes object model that you
create or deploy. A Pod represents processes running on your Cluster .
It is rarely deployed as separate entity. In most cases it is a part of higher level object such as Deployment or ReplicationController. I would advise you to familiarize with general concept of controllers, especially Deployments, as they are currently the recommended way of setting up replication [source]:
Note: A Deployment that configures a ReplicaSet is now the recommended
way to set up replication.
As you can read further:
A ReplicationController ensures that a specified number of pod
replicas are running at any one time. In other words, a
ReplicationController makes sure that a pod or a homogeneous set of
pods is always up and available.
It applies also to situation when certain pods are deleted by user. Replication controller doesn't care why the pods were deleted. Its role is just to make sure they are always up and running. Its very simple concept. When you don't want certain pods to exist any more, you must get rid of the higher level object that manages them and ensures they are always available.
Read about Replication Controllers, then delete the ReplicationController.
It can't "ensure that a specified number of pod replicas are running" when it's dead.
kubectl delete replicationcontroller <name>

Steps of creating and deleting a pod

I'm studying the main components of kubernetes.
I was momentarily stuck regarding the concept of creating (deleting) a pod. In many charts or figures the pods are depicted inside the worker nodes and for this reason I was convinced that they were objects created directly in the worker node.
In depth this concept I came across some pages that see the pod as a simple placeholder in the API server instead.
In this reference link it is said that in the first point the pod is created and in the fourth point that the pod is associated with the node from the API server.
In this reference link it is said that "a new Pod object is created on API server but is not bound to any node."
In this reference link it is said that "A Pod has one container that is a placeholder generated by the Kubernetes API"
All this makes me think that a pod is not actually created in a worker node.
Could someone give me an explanation to clarify this idea for me?
Simply speaking the process of running pod is the following:
User makes API request to create pod in namespace.
API server validates the request making sure that user has necessary authorization to create pod in given namespace and that request conforms to PodSpec.
If request is valid API server creates API object of kind "Pod" in its Etcd database.
Kube-scheduler watches Pods and sees that there is new Pod object. It then evaluates Pod's resources, affinity rules, nodeSelectors, tolerations and so on and finally makes a decision on which node the pod should run. If there are no nodes available due to lack of resources or other constraints - Pod remains in state Pending. Kube-scheduler periodically retries scheduling decisions for Pending pods.
After Pod is scheduled to node kube-scheduler passes the job to kubelet on selected node.
Kubelet is then responsible for actually starting the pod.
There is a good explanation by #Vasily Angapov about the creation and scheduling of a pod, but I think it is important to also add some context of what Pods and containers actually are - if you would like to read more about it you can find a good additional information here.
In essence Pods are created and later on scheduled. So they are not created on a worker node but they are running on node and they are considered easily replaceable and not a durable entity . Therefore whenever something happens to them which results in their termination or deletion they might be started again on different node because of the reasons mentioned in Vasilys answer.
Some more information here.