Specify scheduling order of a Kubernetes DaemonSet - kubernetes

I have Consul running in my cluster and each node runs a consul-agent as a DaemonSet. I also have other DaemonSets that interact with Consul and therefore require a consul-agent to be running in order to communicate with the Consul servers.
My problem is, if my DaemonSet is started before the consul-agent, the application will error as it cannot connect to Consul and subsequently get restarted.
I also notice the same problem with other DaemonSets, e.g Weave, as it requires kube-proxy and kube-dns. If Weave is started first, it will constantly restart until the kube services are ready.
I know I could add retry logic to my application, but I was wondering if it was possible to specify the order in which DaemonSets are scheduled?

Kubernetes itself does not provide a way to specific dependencies between pods / deployments / services (e.g. "start pod A only if service B is available" or "start pod A after pod B").
The currect approach (based on what I found while researching this) seems to be retry logic or an init container. To quote the docs:
They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
This means you can either add retry logic to your application (which I would recommend as it might help you in different situations such as a short service outage) our you can use an init container that polls a health endpoint via the Kubernetes service name until it gets a satisfying response.

retry logic is preferred over startup dependency ordering, since it handles both the initial bringup case and recovery from post-start outages

Related

How to tell Kubernetes to not reschedule a pod unless it dies?

Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?

Web-Server running in an EKS cluster with spot-instances

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Kubernetes. Can i execute task in POD from other POD?

I'm wondering if there is an option to execute the task from POD? So when I have POD which is, for example, listening for some requests and on request will delegate the task to other POD(worker POD). This worker POD is not alive when there are no jobs to do and if there is more than one job to do more PODs will be created. After the job is done "worker PODs" will be stopped. So worker POD is live during one task then killed and when new task arrived new worker POD is started. I hope that I described it properly. Do you know if this is possible in Kubernetes? Worker POD start may be done by, for example, rest call from the main POD.
There are few ways to achieve this behavior.
Pure Kubernetes Way
This solution requires ServiceAccount configuration. Please take a loot at Kubernetes documentation Configure Service Accounts for Pods.
Your application service/pod can handle different custom task. Applying specific serviceaccount to your pod in order to perform specific task is the best practice. In kubernetes using service account with predefined/specified RBAC rules allows you to handle this task almost out of the box.
The main concept is to configure specific RBAC authorization rules for specific service account by giving different permission (get,list,watch,create) to different Kubernetes resources like (pod,jobs).
In this scenario working pod is waiting for incoming request, after it receives specific request it can perform specific task against kubernetes api.
This can be extend i.e. by using sidecar container inside your working pod. More details about sidecar concept can be found in this article.
Create custom controller
Another way to achieve your goal is to use Custom Controller.
Example presented in Extending the Kubernetes Controller article is showing how custom controller watch kubernetes api in order to instrument underling worker pod (watching kubernetes configuration for any changes and then deletes corresponding pods). In your setup, such controller could watch your api for waiting/not processed request and perform additional task like kubernetes job creation inside k8s cluster.
Using existing solution like Job Processing Using a Work Queue.
RabbitMQ on Kubernetes
Coarse Parallel Processing Using a Work Queue
Kubernetes Message Queue
Keda

dual Kubernetes Readiness probes?

I have a scenario where it is required to 'prepare' Kubernetes towards taking off/terminating/shutdown a container, but allow it to serve some requests till that happens.
For example, lets assume that there are three methods: StartAction, ProcessAction, EndAction. I want to prevent clients from invoking StartAction when a container is about to be shutdown. However they should be able use ProcessAction and EndAction on that same container (after all Actions have been completed, the container will shutdown).
I was thinking that this is some sort of 'dual' readiness probe, where I basically want to indicate a 'not ready' status but continue to serve requests for already started Actions.
I know that there is a PreStop hook but I am not confident that this serves the need because according to the documentation I suspect that during the PreStop the pod is already taken off the load balancer:
(simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running Pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
(https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
Assuming that I must rely on stickiness and must continue serving requests for Actions on containers where those actions were started, is there some recommended practice?
I think you can just implement 2 endpoints in your application:
Custom readiness probe
Shutdown preparation endpointList item
So to make graceful shutdown I think you should firstly call "Shutdown preparation endpoint" which will cause that "Custom readiness probe" will return error so Kubernetes will get out that POD from service load balancer (no new clients will come) but existing TCP connections will be kept (existing clients will operate). After your see in some custom metrics (which your service should provide) that all actions for clients are done you should shutdown containers using standard Kubernetes actions. All those actions should be probably automated somehow using Kubernetes and your application APIs.

Starting a container/pod after running the istio-proxy

I am trying to implement a service mesh to a service with Kubernetes using Istio and Envoy. I was able to set up the service and istio-proxy but I am not able to control the order in which the container and istio-proxy are started.
My container is the first started and tries to access an external resource via TCP but at that time, istio-proxy has not completely loaded and so does the ServiceEntry for the external resource
I tried adding a panic in my service and also tried with a sleep of 5 seconds before accessing the external resource.
Is there a way that I can control the order of these?
On istio version 1.7.X and above you can add configuration option values.global.proxy.holdApplicationUntilProxyStarts, which causes the sidecar injector to inject the sidecar at the start of the pod’s container list and configures it to block the start of all other containers until the proxy is ready. This option is disabled by default.
According to https://istio.io/latest/news/releases/1.7.x/announcing-1.7/change-notes/
I don't think you can control the order other than listing the containers in a particular order in your pod spec. So, I recommend you configure a Readiness Probe so that you are pod is not ready until your service can send some traffic to the outside.
Github issue here:
Support startup dependencies between containers on the same Pod
We're currently recommending that developers solve this problem
themselves by running a startup script on their application container
which delays application startup until Envoy has received its initial
configuration. However, this is a bit of a hack and requires changes
to every one of the developer's containers.