Ignite ReadinessProbe - kubernetes

Deploying an ignite cluster within Kubernetes, I cam across an issue that prevents cluster members from joining the group. If I use a readinessProbe and a livenessProbe, even with a delay as low as 10 seconds, they nodes never join each other. If I remove those probes, they find each other just fine.
So, my question is: can you use these probes to monitor node health, and if so, what are appropriate settings. On top of that, what would be good, fast health checks for Ignite, anyway?

Update:
After posting on the ignite mailing list, it looks like StatefulSets are the way to go. (Thanks Dmitry!)
I think I'm going to leave in the below logic to self-heal any segmentation issues although hopefully it won't be triggered often.
Original answer:
We are having the same issue and I think we have a workable solution. The Kubernetes discovery spi lists services as they become ready.
This means that if there are no ready pods at startup time, ignite instances all think that they are the first and create their own grid.
The cluster should be able to self heal if we have a deterministic way to fail pods if they aren't part of an 'authoritative' grid.
In order to do this, we keep a reference to the TcpDiscoveryKubernetesIpFinder and use it to periodically check the list of ignite pods.
If the instance is part of a cluster that doesn't contain the alphabetical first ip in the list, we know we have a segmented topology. Killing the pods that get into that state should cause them to come up again, look at service list and join the correct topology.

I am facing the same issue, using Ignite embedded within a Java spring application.
As you said the readinessProbe: on the Kubernetes Deployment spec.template.spec.container has the side effect to prevent the Kubernetes Pods from being listed on the related Kubernetes Service as Endpoints
Trying without any readinessProbe, it seems to indeed works better (Ignite nodes are all joinging the same Ignite cluster)
Yet this have the undesired side effect of exposing the Kubernetes Pods when not yet ready, as Spring has not yet fully started ...

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Scaling down video conference software in Kubernetes

I'm planning to deploy a WebRTC custom videoconference software (based on NodeJS, using websockets) with Kubernetes, but I have some doubts about scaling down this environment.
Actually, I'm planning to use cloud hosted Kubernetes (GKE, EKS, AKS or any) to be able to auto-scale nodes in the cluster to attend the demand increase and decrease. But, scaling up is not the problem, but it's about scaling down.
The cluster will scale down based on some CPU average usage metrics across the cluster, as I understand, and if it tries to remove some node, it will start to drain connections and stop receiving new connections, right? But now, imagine that there's a videoconference still running in this "pending deletion" node. There are two problems:
1 - Stopping the node before the videoconference finishes (it will drop the meeting)
2 - With the draining behaviour when it starts to scale down, it will stop receiving new connections, so if someone tries to join in this running video conference, it will receive a timeout, right?
So, which is the best strategy to scale down nodes for a video conference solution? Any ideas?
Thanks
I would say this is not a matter of resolving it on kubernetes level by some specific scaling strategy but rather application ability to handle such situations. It isn't even specific to kubernetes. Imagine that you deploy it directly on compute instances which are also subject to autoscale and you'll end up in exactly the same situation when the load decreases and one of the instances is removed from the set.
You should rather ask yourself if such application is suitable to be deployed as kubernetes workload. I can imagine that such videoconference session doesn't have to rely on the backend deployed on a single node only. You can even define some affinity or anti-affinity rules to prevent your Pods from being scheduled on the same node. So if the whole application cluster is still up and running (it's Pods are running on different nodes), eviction of a limited subset of Pods should not have a big impact.
You can actually face the same issue with any other application as vast majority of them base on some session which needs to be established between the client software and the server part. I would say it's application responsibility to be able to handle such scenarios. If some of the users unexpectedly loses the connection it should be possible to immediately redirect them to the running instance e.g. different Pod which is still able to accept new requests.
So basically if the application is designed to be highly available, scaling in (when we talk about horizontal scaling we actually talk about scaling in and scaling out) the underyling VMs, or more specifically kubernetes nodes, shouldn't affect it's high availability capabilities. From the other hand if it is not designed to be highly available, solution such as kubernetes probably won't help much.
There is no best strategy at your use case. When a cloud provider scales down, it is going to get one node randomly and kill it. It's not going to check whether this node has less resource consumption, so let's kill this one. It might end up killing the node with most pods running on it.
I would focus on how you want to schedule your pods. I would try to schedule them, if possible, on a node with running pods already (Pod inter-affinity), and would set up a Pod Disruption Budget to all Deployments/StatefulSets/etc (depending on how you want to run the pods). As a result it would only scale down when there are no pods running on a specific node, and it would kill that node, because on the other nodes there are pods; protected by a PDB.

Kubernetes best practices in pods

As I have been using kubernetes more I keep on seeing the reference that a pod can contain 1 container or more and I have even looked at examples.
My question is whether there is a case where this would be best practice and more efficient to create multi container pods since you can scale and replicate your pods coupling it with a service.
Thanks in advance
A Pod can contain multiple containers, but for the most portion of the situations, it makes perfect sense for the Pod to be simply an abstraction over a single running container.
In what situations does it make sense to have a multi-container deployed Pod?
What comes to my mind are the scenarios where you have a primary Pod running, but you need to tightly couple helper processes, such as a log watcher. In those situations, it makes perfect sense to actually have multiple containers running inside a single pod.
Another big example that comes to my mind is from the Istio project, which is a platform made to connect, manage and secure microservices and is generally referred as a Service Mesh.
A huge part of what it does and is able to accomplish to provide a greater control and customization over the deployed microservices network, is due to the fact that it deploys a sidecar proxy, denominated Envoy, throughout the environment intercepting all network communication between microservices.
Here, you can check an example of load balancing in a Istio service mesh. As you can see the Proxy is deployed inside the Pod, intercepting all communication that goes through it.

Can kubernetes schedule multiple unrelated pods on one host?

If I have 10 different services, each of which are independent from each other and run from their own container, can I get kubernetes to run all of those services on, say, 1 host?
This is unclear in the kubernetes documentation. It states that you can force it to schedule containers from the same pod onto one host, using a "multi-container pod", but it doesn't seem to approach the subject of whether you can have multiple pods running on one host.
In fact kubernetes will do exactly what you want by default. It is capable of running dozens if not hundreds of containers on a single host (depending on its specs).
If you want very advanced control over scheduling pods, there is an alpha feature for that, which introduces concept of node/pod (anti)affinities. But I would say it is a rather advanced k8s topic at the moment, so you are probably good with what is in stable/beta for most use cases.
Honorable mention: there is a nasty trick that allows you to control when pods can not be collocated on the same node. An that is when they both declare same hostPort in their ports section. It can be usefull for some cases, but be aware it affects ie. how rolling deployments happen in some situations.
You can use node selectors and assign the same node for each of the pod to the same node / host
http://kubernetes.io/docs/user-guide/node-selection/
Having said that, the whole point to Kubernetes is to manage a cluster where you can deploy apps / pods across them.

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?
According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?
In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.
In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:
Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
At least one API server can service requests
In a partially degraded state, the API server may be able to respond to requests that only read data.
However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.
There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.
If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).
Kubernetes cluster without a master is like a company running without a Manager.
No one else can instruct the workers(k8s components) other than the Manager(master node)(even you, the owner of the cluster, can only instruct the Manager)
Everything works as usual. Until the work is finished or something stopped them.(because the master node died after assigning the works)
As there is no Manager to re-assign any work for them, the workers will wait and wait until the Manager comes back.
The best practice is to assign multiple managers(master) to your cluster.
Although your data plane and running applications does not immediately starts breaking but there are several scenarios where cluster admins will wish they had multi-master setup. Key to understanding the impact would be understanding which all components talk to master for what and how and more importantly when will they fail if master fails.
Although your application pods running on data plane will not get immediately impacted but imagine a very possible scenario - your traffic suddenly surged and your horizontal pod autoscaler kicked in. The autoscaling would not work as Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and vertical pod autoscaler ( but your API server is already dead).If your pod memory shoots up because of high load then it will eventually lead to getting killed by k8s OOM killer. If any of the pods die, then since controller manager and scheduler talks to API Server to watch for current state of pods so they too will fail. In short a new pod will not be scheduled and your application may stop responding.
One thing to highlight is that Kubernetes system components communicate only with the API server. They don’t
talk to each other directly and so their functionality themselves could fail I guess. Unavailable master plane can mean several things - failure of any or all of these components - API server,etcd, kube scheduler, controller manager or worst the entire node had crashed.
If API server is unavailable - no one can use kubectl as generally all commands talk to API server ( meaning you cannot connect to cluster, cannot login into any pods to check anything on container file system. You will not be able to see application logs unless you have any additional centralized log management system).
If etcd database failed or got corrupted - your entire cluster state data is gone and the admins would want to restore it from backups as early as possible.
In short - a failed single master control plane although may not immediately impact traffic serving capability but cannot be relied on for serving your traffic.