Steps of creating and deleting a pod - kubernetes

I'm studying the main components of kubernetes.
I was momentarily stuck regarding the concept of creating (deleting) a pod. In many charts or figures the pods are depicted inside the worker nodes and for this reason I was convinced that they were objects created directly in the worker node.
In depth this concept I came across some pages that see the pod as a simple placeholder in the API server instead.
In this reference link it is said that in the first point the pod is created and in the fourth point that the pod is associated with the node from the API server.
In this reference link it is said that "a new Pod object is created on API server but is not bound to any node."
In this reference link it is said that "A Pod has one container that is a placeholder generated by the Kubernetes API"
All this makes me think that a pod is not actually created in a worker node.
Could someone give me an explanation to clarify this idea for me?

Simply speaking the process of running pod is the following:
User makes API request to create pod in namespace.
API server validates the request making sure that user has necessary authorization to create pod in given namespace and that request conforms to PodSpec.
If request is valid API server creates API object of kind "Pod" in its Etcd database.
Kube-scheduler watches Pods and sees that there is new Pod object. It then evaluates Pod's resources, affinity rules, nodeSelectors, tolerations and so on and finally makes a decision on which node the pod should run. If there are no nodes available due to lack of resources or other constraints - Pod remains in state Pending. Kube-scheduler periodically retries scheduling decisions for Pending pods.
After Pod is scheduled to node kube-scheduler passes the job to kubelet on selected node.
Kubelet is then responsible for actually starting the pod.

There is a good explanation by #Vasily Angapov about the creation and scheduling of a pod, but I think it is important to also add some context of what Pods and containers actually are - if you would like to read more about it you can find a good additional information here.
In essence Pods are created and later on scheduled. So they are not created on a worker node but they are running on node and they are considered easily replaceable and not a durable entity . Therefore whenever something happens to them which results in their termination or deletion they might be started again on different node because of the reasons mentioned in Vasilys answer.
Some more information here.

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Assigning a volume for log dispatching with K8s

I want to ship my application logs from my K8s pods to hosted ELK. I want to used a PVC. This Link mentions that the pod should be owned by a StatefulSet. Is this a general recommendation or specific to DigitalOcean?
This is a general recommendation. The reason is, you will have to ensure that your logging pods should be ready before the application pods can start. Additionally, say you had a Pod named logging-pod-1 sending logs of application pod app-pod-1 to elk, and the logging-pod-1 crashes for some reason, using a StatefulSet ensure that the new pod started assumes the identity of logging-pod-1, and this way the "state" of the operation can be maintained across failures/rescheduling/restarts, ensuring that no logs are missed and logging isn't affected.
This part of "assuming the identity" can mean something like, when the new pod comes up, k8s will know which PVC to attach to this pod. This will be the same PVC that the crashed pod was using, and therefore the new pod will simply "take over the task" by mounting the same PV and starting on its work.
This is a very common design pattern and since you need to ensure ordering, a StatefulSet is a natural choice for deploying this logging functionality. You can learn more about StatefulSets here in the docs.

Persistent Kafka transacton-id across restarts on Kubernetes

I am trying to achieve the exactly-once delivery on Kafka using Spring-Kafka on Kubernetes.
As far as I understood, the transactional-ID must be set on the producer and it should be the same across restarts, as stated here https://stackoverflow.com/a/52304789/3467733.
The problem arises using this semantic on Kubernetes. How can you get a consistent ID?
To solve this problem I implementend a Spring boot application, let's call it "Replicas counter" that checks, through the Kubernetes API, how many pods there are with the same name as the caller, so I have a counter for every pod replica.
For example, suppose I want to deploy a Pod, let's call it APP-1.
This app does the following:
It perfoms a GET to the Replicas-Counter passing the pod-name as parameter.
The replicas-counter calls the Kubernetes API in order to check how many pods there are with that pod name. So it does a a +1 and returns it to the caller. I also need to count not-ready pods (think about a first deploy, they couldn't get the ID if I wasn't checking for not-ready pods).
The APP-1 gets the id and will use it as the transactional-id
But, as you can see a problem could arise when performing rolling updates, for example:
Suppose we have 3 pods:
At the beginning we have:
app-1: transactional-id-1
app-2: transactional-id-2
app-3: transactional-id-3
So, during a rolling update we would have:
old-app-1: transactional-id-1
old-app-2: transactional-id-2
old-app-3: transactional-id-3
new-app-3: transactional-id-4 (Not ready, waiting to be ready)
New-app-3 goes ready, so Kubernetes brings down the Old-app-3. So time to continue the rolling update.
old-app-1: transactional-id-1
old-app-2: transactional-id-2
new-app-3: transactional-id-4
new-app-2: transactional-id-4 (Not ready, waiting to be ready)
As you can see now I have 2 pods with the same transactional-id.
As far as I understood, these IDs have to be the same across restarts and unique.
How can I implement something that gives me consistent IDs? Is there someone that have dealt with this problem?
The problem with these IDs are only for the Kubernetes Deployments, not for the Stateful-Sets, as they have a stable identifier as name. I don't want to convert all deployment to stateful sets to solve this problem as I think it is not the correct way to handle this scenario.
The only way to guarantee the uniqueness of Pods is to use StatefulSet.
StatefulSets will allow you to keep the number of replicas alive but everytime pod dies it will be replaced with the same host and configuration. That will prevent data loss that is required.
Service in Statefulset must be headless because since each pod is going to be unique, so you are going to need certain traffic to reach certain pods.
Every pod require a PVC (in order to store data and recreate whenever pod is deleted from that data).
Here is a great article describing why StatefulSet should be used in similar case.

Is container where liveness or readiness probes's config are set to a "pod check" container?

I'm following this task Configure Liveness, Readiness and Startup Probes
and it's unclear to me whether a container where the check is made is a container only used to check the availability of a pod? Because it makes sense if pod check container fails therefore api won't let any traffic in to the pod.
So a health check signal must be coming from container where some image or app runs? (sorry, another question)
From the link you provided it seems like they are speaking about Containers and not Pods so the probes are meant to be per containers. When all containers are ready the pod is described as ready too as written in the doc you provided :
The kubelet uses readiness probes to know when a Container is ready to
start accepting traffic. A Pod is considered ready when all of its
Containers are ready. One use of this signal is to control which Pods
are used as backends for Services. When a Pod is not ready, it is
removed from Service load balancers.
So yes, every containers that are running some images or apps are supposed to expose those metrics.
Livenes and readiness probes as described by Ko2r are additional checks inside your containers and verified by kubelet according to the settings fro particular probe:
If the command (defined by health-check) succeeds, it returns 0, and the kubelet considers the Container to be alive and healthy. If the command returns a non-zero value, the kubelet kills the Container and restarts it.
In addition:
The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.
Fro another point of view:
Pod is a top-level resource in the Kubernetes REST API.
As per docs:
Pods are ephemeral. They are not designed to run forever, and when a Pod is terminated it cannot be brought back. In general, Pods do not disappear until they are deleted by a user or by a controller.
Information about controllers can find here:
So the best practise is to use controllers like describe above. You’ll rarely create individual Pods directly in Kubernetes–even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a Controller), it is scheduled to run on a Node in your cluster. The Pod remains on that Node until the process is terminated, the pod object is deleted, the Pod is evicted for lack of resources, or the Node fails.
Note:
Restarting a container in a Pod should not be confused with restarting the Pod. The Pod itself does not run, but is an environment the containers run in and persists until it is deleted
Because Pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up). Users should be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When a user requests deletion of a Pod, the system records the intended grace period before the Pod is allowed to be forcefully killed, and a TERM signal is sent to the main process in each container. Once the grace period has expired, the KILL signal is sent to those processes, and the Pod is then deleted from the API server. If the Kubelet or the container manager is restarted while waiting for processes to terminate, the termination will be retried with the full grace period.
The Kubernetes API server validates and configures data for the api objects which include pods, services, replicationcontrollers, and others. The API Server services REST operations and provides the frontend to the cluster’s shared state through which all other components interact.
For example, when you use the Kubernetes API to create a Deployment, you provide a new desired state for the system. The Kubernetes Control Plane records that object creation, and carries out your instructions by starting the required applications and scheduling them to cluster nodes–thus making the cluster’s actual state match the desired state.
Here you can find information about processing pod termination.
There are different probes:
For example for HTTP probe:
even if your app isn’t an HTTP server, you can create a lightweight HTTP server inside your app to respond to the liveness probe.
Command
For command probes, Kubernetes runs a command inside your container. If the command returns with exit code 0 then the container is marked as healthy.
More about probes and best practices.
Hope this help.

How to know how long it takes to create a pod in Kubernetes ? Is there any command please?

I have my Kubernetes cluster and I need to know how long it takes to create a pod? Is there any Kubernetes command show me that ?
Thanks in advance
What you are asking for is not existing.
I think you should first understand the Pod Overview.
A Pod is the basic building block of Kubernetes–the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod represents a running process on your cluster.
A Pod encapsulates an application container (or, in some cases, multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run. A Pod represents a unit of deployment: a single instance of an application in Kubernetes, which might consist of either a single container or a small number of containers that are tightly coupled and that share resources.
While you are deploying a POD it's going through phases
Pending
The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being scheduled as well as time spent downloading images over the network, which could take a while.
Running
The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting.
Succeeded
All Containers in the Pod have terminated in success, and will not be restarted.
Failed
All Containers in the Pod have terminated, and at least one Container has terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system.
Unknown
For some reason the state of the Pod could not be obtained, typically due to an error in communicating with the
host of the Pod.
As for Pod Conditions it have a type which can have following values:
PodScheduled: the Pod has been scheduled to a node;
Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services;
Initialized: all init containers have started successfully;
Unschedulable: the scheduler cannot schedule the Pod right now, for example due to lacking of resources or other constraints;
ContainersReady: all containers in the Pod are ready.
Please refer to the documentation regarding Pod Lifecycle for more information.
When you are deploying your POD, you have to consider how many containers will be running in it.
The image will have to be downloaded, depending on the size it might take longer. Also default pull policy is IfNotPresent, which means that Kubernetes will skip the image pull if it already exists.
You can find more about Updating Images can be found here.
You also need to consider how much resources your Master and Node has.