Kubernetes priority of remove pods in 1.8.1 - kubernetes

we have a small problem with the kubernetes cluster.
Because one of our applications is so demanding that sometimes consume all of our resources and finally some of pods are killed. The real problem starts when system pods like flannel or cache became removed.
Is there a recommended way to control what is being removed? How "save" system pods? Maybe someone has experience in this topic?
One of the ideas is to change QoS for all pods/apps from the kube-system to "Guaranteed". But I'm afraid that this will not work well if we limit resources, even with a large margin.
Btw. where can I read about what (default) requirements system services have? How set it on cluster creation phase?
The second idea is setting the Eviction Policy and/or Taints and Tolerations, but there is a anxiety that our key application will be (re)moved as one of the first. Unfortunately it currently works only in one copy and the initialization can take up to several minutes, so switching between nodes is currently unacceptable and impossible.
The final idea is to use Priority and Preemption, but from what I see in the 1.8.1 documentation is still in the "alpha" phase, and I have serious concerns about the stability of this solution.
Maybe there is something else I did not think about? I will be happy to listen other proposals.

Related

Is performance testing multiple deployment stacks in one Kuberentes cluster a valid test?

We have a deployment stack with about 20 microservices/pods. Each deployment goes to its own namespace. To make sure that the cpu and memory are guaranteed for each pod and not shared, we set the request amounts the same as limit amount. Now we sometimes need to deploy more stack into the same performance cluster, e.g. testing different releases of the same stack. The question is whether having more than one deployment in one cluster can invalidate the test result due to shared network or some other reasons?
Initially we were thinking to create one cluster for each performance testing to make sure it is isolated and test results are correct but creating a new cluster and maintaining it a very costly. We also thought about making sure each deployment goes to one node to avoid load testing on one stack impact the others but I'm not sure if that really helps. Please share your knowledge on this as Kubernetes is almost new to us.
If the containers are running on the same underlying hosts then bleedthrough is always possible. If you set all pods into Guaranteed QoS mode (aka requests == limits) then it at least reduces the bleedthrough to a minimum. Running things on one cluster is always fine but if you want to truly reduce the crosstalk to zero then you would need dedicated workload nodes for each.

Guessing kubernetes limits for kubernetes deployments

Is there any way we can correctly guess how much resource limits we need to keep for running deployments on kubernetes clusters.
Yes, you can guess that single threaded application most likely won't need more that 1 CPU.
For any other programs: no, there is not easy way to guess it. Every application is different, and reacts differently under different workloads.
The easiest way to figure out how many resources it needs is to run it and measure it.
Run some benchmarks/profilers and see how application behaves. Then make decisions based on that.

Why is enforcing system-reserved reservations in Kubernetes dangerous?

I'm reading the Kubernetes docs on Reserve Compute Resources for System Daemons, and it says "Be extra careful while enforcing system-reserved reservation since it can lead to critical system services being CPU starved, OOM killed, or unable to fork on the node."
I've seen this warning in a few places, and I'm having a hard time understanding the practical implication.
Can someone give me a scenario in which enforcing system-reserved reservation would lead to system services being starved, etc, that would NOT happen if I did not enforce it?
You probably have at least a few things running on the host nodes outside of Kubernetes' view. Like systemd, some hardware stuffs, maybe sshd. Minimal OSes like CoreOS definitely have a lot less, but if you're running on a more stock OS image, you need to leave room for all the other gunk that comes with them. Without leaving RAM set aside, the Kubelet will happily use it all up and then when you go to try and SSH in to debug why your node has gotten really slow and unstable, you won't be able to.

Running one pod per node with deterministic hostnames

I have what I believe is a simple goal, but I can't figure out how to get Kubernetes to play ball.
For my particular application, I am trying to deploy a number of replicas of a docker image that is a worker for another service. This system uses the hostname of the worker to distinguish between workers that are running at the same time.
I would like to be able to deploy a cluster where every node runs a worker for this service.
The problem is that the master also keeps track of every worker that ever worked for it, and displays these in a status dashboard. The intent is that you spin up a fixed number of workers by hand and leave it that way. I would like to be able to resize my cluster and have the number of workers change accordingly.
This seems like a perfect application for DaemonSet, except that then the hostnames are randomly generated and the master ends up tracking many orphaned hostnames.
An alternative might be StatefulSet, which gives us deterministic hostnames, but I can't find a way to force it to scale to one pod per node.
The system I am running is open source and I am looking into changing how it identifies workers to avoid this mess, but I was wondering if there was any sensible way to dynamically scale a StatefulSet to the number of nodes in the cluster. Or any way to achieve similar functionality.
The one way is to use nodeSelector, but I totally agree with #Markus: the more correct and advanced way is to use anti-affinity. This is really powerful and at the same time simple solution to prevent scheduling pods with the same labels to 1 node.

How to automatically scale number of pod based on load?

We have a service which is fairly idle most of the time, hence it would be great for us if we could delete all the pods when the service is not getting any request for say 30 minutes, and in the next time when a new request comes Kubernetes will create the first pod and process the response.
Is it possible to set the min pod instance count to 0?
I found that currently, Kubernetes does not support this, is there a way I can achieve this?
This is not supported in Kubernetes the way it's supported by web servers like nginx, apache or app engines like puma, passenger, gunicorn, unicorn or even Google App Engine Standard where they can be soft started and then brought up the moment the first request comes in with downside of this is that your first requests will always be slower. (There may have been some rationale behind Kubernetes pods not having to behave this way, and I can see a lot of design changes or having to create a new type of workload for this very specific case)
If a pod is sitting idle it would not be consuming that many resources. You could tweak the values of your pod resources for request/limit so that you request a small number of CPUs/Memory and you set the limit to a higher number of CPUs/Memory. The upside of having a pod always running is that in theory, your first requests will never have to wait a long time to get a response.
Yes. You can achieve that using Horizontal Pod Autoscale.
See example of Horizontal Pod Autoscale: Horizontal Pod Autoscaler Walkthrough