What is the concept of Service Affinity in OpenShift? - kubernetes

Situation
When a deployment fails on our OpenShift 3.11 instance because of a Failed Scheduling error event, a message comparable to the following shown:
Failed Scheduling 0/11 nodes are available: 10 CheckServiceAffinity, 2 ExistingPodsAntiAffinityRulesNotMatch, 2 MatchInterPodAffinity, 5 MatchNodeSelector.
In the above error message, the term CheckServiceAffinity is used. While it's easy to find articles on Pod Affinity or Anti-Affinity, I couldn't find a detailed description of Service Affinity.
Question
What is Service Affinity?
Is it a concept of Kubernetes or is it exclusive to OpenShift?

ServiceAffinity places pods on nodes based on the service running on that pod. Placing pods of the same service on the same or co-located nodes can lead to higher efficiency.
It's a concept of openshift and not of open source kubernetes.
https://docs.openshift.com/container-platform/3.9/admin_guide/scheduling/scheduler.html#configurable-predicates

Related

Web-Server running in an EKS cluster with spot-instances

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

Whats the difference between "Pods" and "Static Pods" in kubernetes and when to choose "static pods" over regular pods

New to kubernetes and looking to understand the best practice around using different kubernetes objects and have difficulty understanding what is a major difference between "Pods" and "Static Pods" functionally if there is any ?
Major questions as below :
Question 1: What is a major difference between "Pods" and "Static Pods" functionally if there is any ?
Question 2: When to choose "static pods" over regular pods.
Static pods are pods created and managed by kubelet daemon on a specific node without API server observing them. If the static pod crashes, kubelet restarts them. Control plane is not involved in lifecycle of static pod. Kubelet also tries to create a mirror pod on the kubernetes api server for each static pod so that the static pods are visible i.e., when you do kubectl get pod for example, the mirror object of static pod is also listed.
You almost never have to deal with static pods. Static pods are usually used by software bootstrapping kubernetes itself. For example, kubeadm uses static pods to bringup kubernetes control plane components like api-server, controller-manager as static pods. kubelet can watch a directory on the host file system (configured using --pod-manifest-path argument to kubelet) or sync pod manifests from a web url periodically (configured using --manifest-url argument to kubelet). When kubeadm is bringing up kubernetes control plane, it generates pod manifests for api-server, controller-manager in a directory which kubelet is monitoring. Then kubelet brings up these control plane components as static pods.
One of the use case of static pod is kubernetes control plane bootstrapping. Kubeadm while bootstarping a kubernetes cluster creates API Server, controller manager , kube scheduler as a static pod because it can not create these as normal pod due to the fact that kube Api Server itself is not available yet.
As i understand from official documentation link https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/
"POD" life cycle is handled by kubernetes control plane (from cluster master).
"Static pod" life cycle is in control of "kublet" running on given worker node where it is created and not in control of cluster master.
Reading more i found below on offical docs
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#static-pods
Which answers my Question-2
"when to use static pods" >> "static pods" are useful in cluster bootstrapping cases. and a note stating that static Pods may be deprecated in the future.

Kubernetes Deployment with Zero Down Time

As a leaner of Kubernetes concepts, their working, and deployment with it. I have a couple of cases which I don't know how to achieve. I am looking for advice or some guideline to achieve it.
I am using the Google Cloud Platform. The current running flow is described below. A push to the google source repository triggers Cloud Build which creates a docker image and pushes the image to the running cluster nodes.
Case 1: Now I want that when new pods are up and running. Then traffic is routed to the new pods. Kill old pod but after each pod complete their running request. Zero downtime is what I'm looking to achieve.
Case 2: What will happen if the space of running pod reaches 100 and in the Debian case that the inode count reaches full capacity. Will kubernetes create new pods to manage?
Case 3: How to manage pod to database connection limits?
Like the other answer use Liveness and Readiness probes. Basically, a new pod is added to the service pool then it will only serve traffic after the readiness probe has passed. The old pod is removed from the Service pool, then drained and then terminated. This happens on a rolling fashion one pod at a time.
This really depends on the capacity of your cluster and the ability to schedule pods depending on the limits for the containers in them. For more about setting up limits for containers refer to here. In terms of the inode limit, if you reach it on a node, the kubelet won't be able to run any more pods on that node. The kubelet eviction manager also has a mechanism in where evicts some pods using the most inodes. You can also configure your eviction thresholds on the kubelet.
This would be more a limitation at the OS level combined your stateful application configuration. You can keep this configuration in a ConfigMap. And for example in something for MySql the option would be max_connections.
I can answer case 1 since Ive done it myself.
Use Deployments with readinessProbes & livelinessProbes

DaemonSets on Google Container Engine (Kubernetes)

I have a Google Container Engine cluster with 21 nodes, there is one pod in particular that I need to always be running on a node with a static IP address (for outbound purposes).
Kubernetes supports DaemonSets
This is a way to have a pod be deployed to a specific node (or in a set of nodes) by giving the node a label that matches the nodeSelector in the DaemonSet. You can then assign a static IP to the VM instance that the labeled node is on. However, GKE doesn't appear to support the DaemonSet kind.
$ kubectl create -f go-daemonset.json
error validating "go-daemonset.json": error validating data: the server could not find the requested resource; if you choose to ignore these errors, turn validation off with --validate=false
$ kubectl create -f go-daemonset.json --validate=false
unable to recognize "go-daemonset.json": no kind named "DaemonSet" is registered in versions ["" "v1"]
When will this functionality be supported and what are the workarounds?
If you only want to run the pod on a single node, you actually don't want to use a DaemonSet. DaemonSets are designed for running a pod on every node, not a single specific node.
To run a pod on a specific node, you can use a nodeSelector in the pod specification, as documented in the Node Selection example in the docs.
edit: But for anyone reading this that does want to run something on every node in GKE, there are two things I can say:
First, DaemonSet will be enabled in GKE in version 1.2, which is planned for March. It isn't enabled in GKE in version 1.1 because it wasn't considered stable enough at the time 1.1 was cut.
Second, if you want to run something on every node before 1.2 is out, we recommend creating a replication controller with a number of replicas greater than your number of nodes and asking for a hostPort in the container spec. The hostPort will ensure that no more than one pod from the RC will be run per node.
DaemonSets is still alpha feature and Google Container Engine supports only production Kubernetes features. Workaround: build your own Kubernetes cluster (GCE, AWS, bare metal, ...) and enable alpha/beta features.