Cloud Composer 2: prevent eviction of worker pods - kubernetes

I am currently planning to upgrade our Cloud Composer environment from Composer 1 to 2. However I am quite concerned about disruptions that could occur in Cloud Composer 2 due to the new autoscaling behavior inherited from GKE Autopilot. In particular since nodes will now auto-scale based on demand, it seems like nodes with running workers could be killed off if GKE thinks the workers could be rescheduled elsewhere. This would be bad because my code isn't currently very tolerant to retries.
I think that this can be prevented by adding the following annotation to the worker pods: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
However, I don't know how to add annotations to worker pods created by Composer (I'm not creating them myself, after all). How can I do that?
EDIT: I think this issue is made more complex by the fact that it should still be possible for the cluster to evict a pod once it's finished processing all its Airflow tasks. If the annotation is added but doesn't go away once the pod is finished processing, I'm worried that could prevent the cluster from ever scaling down.
So a more dynamic solution may be needed, perhaps one that takes into account the actual tasks that Airflow is processing.

If I have understood your problem well. Could you please try this solution:
In the Cloud Composer environment, navigate to the Kubernetes Engine --> Workloads page in the GCP Console.
Find the worker pod you want to add the annotation to and click on the name of the pod.
On the pod details page, click on the Edit button.
In the Pod template section, find the Annotations field and click on
the pencil icon to edit.
In the Edit annotations field, add the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
Click on the Save button to apply the change.
Let me know if it works fine. Good luck.

Related

reversing the order of pod recreation in Kubernetes Vertical Pod Autoscaler

I am trying to use VPA for autoscaling my deployed services. Due to limitation in resources in my cluster I set the min_replica option to 1. The workflow of VPA that have seen so far is that it first deletes the existing pod and then re-create the pod. This approach will cause a downtime to my services. What I want is that the VPA first create the new pod and then deletes the old pod, completely similar to the rolling updates for deployments. Is there an option or hack to reverse the flow to the desired order in my case?
This can be achieved by using python script or by using an IAC pipeline, you can get the metrics of the kubernetes cluster and whenever these metrics exceed a certain threshold, trigger this python code for creating new pod with the required resources and shutdown the old pod. Follow this github link for more info on python plugin for kubernetes.
Ansible can also be used for performing this operation. This can be achieved by triggering your ansible playbook whenever the threshold breaches a certain limit and you can specify the new sizes of the pods that need to be created. Follow this official ansible document for more information. However both these procedures involve manual analysis for selecting the desired pod size for scaling. So if you don’t want to use vertical scaling you can go for horizontal scaling.
Note: The information is gathered from official Ansible and github pages and the urls are referred to in the post.

Prevent Kubernetes from schedulling everything to the same node

So I have 4 nodes currently, and Kubernetes, for some reason, decides to always schedule everything to the same node.
I'm not talking about replicas of the same deployment, so topologySpreadConstraints wouldn't apply there. In fact, when I scale up a deployment to several replicas, they get scheduled to different nodes. However, any new deployment and any new volume always go to the same node.
Affinity constraints also work, if I configure a pod to only schedule to a specific node (different from the usual one) it works fine. But anything else, goes to the same node. Is this considered normal? The node is at 90% utilization, and even when it crashes completely, Kubernetes happily schedules everything to it again.
Okay, so this was a very specific issue, and I'm not sure whether I actually resolved it, but it seems to be working now.
This was a cluster deployed on hetzner and using the hetzner cloud controller manager. I had been removing and adding nodes to the cluster and, as it turns out, I forgot to add the flag --cloud-provider=external to this one's kubelet
This issue is pretty well known. It was specifically showing as a "missing prefix" event, so I never thought it was related.
To solve it, adding the flag and restarting the kubelet was not enough for me. So I had to drain the node, remove it from the cluster, build it from scratch and re-join it then with the correct flag. This not only solved the "missing prefix" issue, but it also seems to have also solved the scheduling issue, though I'm not sure why.

Is it possible to schedule a pod to run for say 24 hours and then remove deployment/statefulset? or need to use jobs?

We have a bunch of pods running in dev environment. The pods are auto-provisioned by an application on every business action. The problem is that across various namespaces they are accumulating and eating available resources in EKS.
Is there a way without jenkins/k8s jobs to simply put some parameter on the pod manifest to tell it to self destruct say in 24 hours?
Add to your pod.spec:
activeDeadlineSeconds: 86400
After deadline your Pod will be stopped for good with the status DeadlineExceeded
If I understood your situation properly, you would like to scale your cluster down in order to save resources.
Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
Last, but not least, you can use tool like Ansible to manage all your kubernetes assets (it can create/manage deployments via playbooks).
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

Ignite ReadinessProbe

Deploying an ignite cluster within Kubernetes, I cam across an issue that prevents cluster members from joining the group. If I use a readinessProbe and a livenessProbe, even with a delay as low as 10 seconds, they nodes never join each other. If I remove those probes, they find each other just fine.
So, my question is: can you use these probes to monitor node health, and if so, what are appropriate settings. On top of that, what would be good, fast health checks for Ignite, anyway?
Update:
After posting on the ignite mailing list, it looks like StatefulSets are the way to go. (Thanks Dmitry!)
I think I'm going to leave in the below logic to self-heal any segmentation issues although hopefully it won't be triggered often.
Original answer:
We are having the same issue and I think we have a workable solution. The Kubernetes discovery spi lists services as they become ready.
This means that if there are no ready pods at startup time, ignite instances all think that they are the first and create their own grid.
The cluster should be able to self heal if we have a deterministic way to fail pods if they aren't part of an 'authoritative' grid.
In order to do this, we keep a reference to the TcpDiscoveryKubernetesIpFinder and use it to periodically check the list of ignite pods.
If the instance is part of a cluster that doesn't contain the alphabetical first ip in the list, we know we have a segmented topology. Killing the pods that get into that state should cause them to come up again, look at service list and join the correct topology.
I am facing the same issue, using Ignite embedded within a Java spring application.
As you said the readinessProbe: on the Kubernetes Deployment spec.template.spec.container has the side effect to prevent the Kubernetes Pods from being listed on the related Kubernetes Service as Endpoints
Trying without any readinessProbe, it seems to indeed works better (Ignite nodes are all joinging the same Ignite cluster)
Yet this have the undesired side effect of exposing the Kubernetes Pods when not yet ready, as Spring has not yet fully started ...

Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?
While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.
I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.
Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.
So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?
You can achieve this manually with the following:
In Logs Viewer, creating the following filter:
resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"
Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.
Would be happy to be informed of a built-in metric instead of this.
There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics
Metric: kubernetes.io/container/restart_count
Resource type: k8s_container
In my cluster (a bare-metal k8s cluster),I use kube-state-metrics https://github.com/kubernetes/kube-state-metrics to do what you want. This project belongs to kubernetes repo and it is quite easy to use. Once deployed u can use kube_pod_container_status_restarts this metrics to know if a container restarts
Others have commented on how to do this with metrics, which is the right solution if you have a very large number of crashing pods.
An alernative approach is to treat crashing pods as discrete events or even log-lines. You can do this with Robusta (disclaimer, I wrote this) with YAML like this:
triggers:
- on_pod_update: {}
actions:
- restart_loop_reporter:
restart_reason: CrashLoopBackOff
- image_pull_backoff_reporter:
rate_limit: 3600
sinks:
- slack
Here we're triggering an action named restart_loop_reporter whenever a pod updates. The data stream comes from the APIServer.
The restart_loop_reporter is an action which filters out non-crashing pods. Above it's configured to report only on CrashLoopBackOffs but you could remove that to report all crashes.
A benefit of doing it this way is that you can gather extra data about the crash automatically. For example, the above will fetch the pod's logs and forward them along with the crash report.
I'm sending the result here to Slack, but you could just as well send it to a structured output like Kafka (already builtin) or Stackdriver (not yet supported, but I can fix that if you like).
Remember that, you can always raise feature request if the options available are not enough.