Duplicate metrics with multiple instances of kube-state-metrics - kubernetes

Problem:
Duplicate data when querying from prometheus for metrics from kube-state-metrics.
Sample query and result with 3 instances of kube-state-metrics running:
Query:
kube_pod_container_resource_requests_cpu_cores{namespace="ns-dummy"}
Metrics
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.142:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.35.17:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-34-25.ec2.internal",pod="app1-appname-6bd9d8d978-gfk7f",service="prom-kube-state-metrics"}
1
kube_pod_container_resource_requests_cpu_cores{container="appname",endpoint="http",instance="172.232.37.171:8080",job="kube-state-metrics",namespace="ns-dummy",node="ip-172-232-35-22.ec2.internal",pod="app2-appname-ccbdfc7c8-g9x6s",service="prom-kube-state-metrics"}
Observation:
Every metric is coming up Nx when N pods are running for kube-state-metrics. If it's a single pod running, we get the correct info.
Possible solutions:
Scale down to single instance of kube-state-metrics. (Reduced availability is a concern)
Enable sharding. (Solves duplication problem, still less available)
According to the docs, for horizontal scaling we have to pass sharding arguments to the pods.
Shards are zero indexed. So we have to pass the index and total number of shards for each pod.
We are using Helm chart and it is deployed as a deployment.
Questions:
How can we pass different arguments to different pods in this scenario, if its possible?
Should we be worried about availability of the kube-state-metrics considering the self-healing nature of k8s workloads?
When should we really scale it to multiple instances and how?

You could use a 'self-healing' deployment with only a single replica of kube-state-metric if the container down, the deployment will start a new container. Since kube-state-metric is not focused on the health of the individual kubernetes components. It only will affect you if your cluster is too big and make many objects changes per second.
It is not focused on the health of the individual Kubernetes components, but rather on the health of the various objects inside, such as deployments, nodes and pods.
For small cluster there's is no problem use in this way, but you really need a high availability monitoring platform I recommend you take a look in this two articles:
creating a well designed and highly available monitoring stack for kubernetes and
kubernetes monitoring

Related

Kubernetes - Multiple pods per node vs one pod per node

What is usually preferred in Kubernetes - having a one pod per node configuration, or multiple pods per node?
From a performance standpoint, what are the benefits of having multiple pods per node, if there is an overhead in having multiple pods living on the same node?
From a performance standpoint, wouldn't it be better to have a single pod per node?
The answer to your question is heavily dependent on your workload.
There are very specific scenarios (machine learning, big data, GPU intensive tasks) where you might have a one pod per node configuration due to an IO or hardware requirement for a singular pod. However, this is normally not a efficient use of resources and sort of eliminates a lot of the benefits of containerization.
The benefit of multiple pods per node is a more efficient use of all available resources. Generally speaking, managed kubernetes clusters will automatically schedule and manage the amount of pods that run on a node for you automatically, and many providers offer simple autoscaling solutions to ensure that you are always able to run all your workloads.
Running only a single pod per node has its cons as well. For example each node will need its own "support" pods such as metrics, logs, network agents and other system pods which most likely will not have its all resources
fully utilized. Which in terms of performance would mean that selecting the correct node size to pods amount ratio might result with less costs for the same performance as single pod per node.
On the contrary running too many pods in a massive node can cause lack of those resources and cause metrics or logs gaps or lost packets OOM errors etc.
Finally, when we also consider auto scaling, scaling up couple more pods on an existing nodes will be lot more responsive than scaling up a new node for each pod.

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

Dynamic scaling for statefulset best practices

Background
I have app running in kubernetes cluster using sharded mongodb and elasticsearch statefulsets. I setup horizontal pod autoscalers for deployment components in my app and everything works well.
Problems
Problems arise when the traffic goes up. My server deployment scales out just fine, but mongodb shards and elasticsearch nodes cannot handle this much traffic and throttle overall response time.
Simple solution is to configure those statefulset with more shards, more replicas. What bugs me is that traffic spike happens like 3-4 hours a day, thus it's kinda wasteful to let all those boys sitting idly for the rest of the day.
I did some research and looks like database in general is not supposed to scale out/in dynamically as it will consume a lot of network and disk io just to do replication between them. Also there is potential of data loss and inconsistency during scaling up, scaling down.
Questions
If possible, what is proper way to handle dynamic scaling in mongodb, elasticsearch... and database in general?
If not, what can I do to save some cents off my cloud bill as we only need the maximum power from database pods for a short period per day.
You should read about Kubernetes autoscaling - HPA.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
With HPA you should have to also take care about the volume mounting and data latency.
As #Serge mentioned in comments, I would suggest to check the native scaling cluster option provided by the MongoDB and Elasticsearch itself.
Take a look at
MongoDB operator documentation
Elasticsearch operator documentation
Elasticsearch future release autoscaling
I am not very familiar with MongoDB and Elasticsearch with Kubernetes, but maybe those tutorials help you:
https://medium.com/faun/scaling-mongodb-on-kubernetes-32e446c16b82
https://www.youtube.com/watch?v=J7h0F34iBx0
https://kubernetes.io/blog/2017/01/running-mongodb-on-kubernetes-with-statefulsets/
https://sematext.com/blog/elasticsearch-operator-on-kubernetes/#toc-what-is-the-elasticsearch-operator-1
If you use helm take a look at banzaicloud Horizontal Pod Autoscaler operator
You may not want nor can edit a Helm chart just to add an autoscaling feature. Nearly all charts supports custom annotations so we believe that it would be a good idea to be able to setup autoscaling just by adding some simple annotations to your deployment.
We have open sourced a Horizontal Pod Autoscaler operator. This operator watches for your Deployment or StatefulSet and automatically creates an HorizontalPodAutoscaler resource, should you provide the correct autoscale annotations.
Hope you find this useful.

Question about 100 pods per node limitation

I'm trying to build a web app where each user gets their own instance of the app, running in its own container. I'm new to kubernetes so I'm probably not understanding something correctly.
I will have a few physical servers to use, which in kubernetes as I understand are called nodes. For each node, there is a limitation of 100 pods. So if I am building the app so that each user gets their own pod, will I be limited to 100 users per physical server? (If I have 10 servers, I can only have 500 users?) I suppose I could run multiple VMs that act as nodes on each physical server but doesn't that defeat the purpose of containerization?
The main issue in having too many pods in a node is because it will degrade the node performance and makes is slower(and sometimes unreliable) to manage the containers, each pod is managed individually, increasing the amount will take more time and more resources.
When you create a POD, the runtime need to keep a constant track, doing probes (readiness and Liveness), monitoring, Routing rules many other small bits that adds up to the load in the node.
Containers also requires processor time to run properly, even though you can allocate fractions of a CPU, adding too many containers\pod will increase the context switch and degrade the performance when the PODs are consuming their quota.
Each platform provider also set their own limits to provide a good quality of service and SLAs, overloading the nodes is also a risk, because a node is a single point of failure, and any fault in high density nodes might have a huge impact in the cluster and applications.
You should either consider:
Smaller nodes and add more nodes to the cluster or
Use Actors instead, where each client will be one Actor. And many actor will be running in a single container. To make it more balanced around the cluster, you partition the actors into multiple containers instances.
Regarding the limits, this thread has a good discussion about the concerns
Because of the hard limit if you have 10 servers you're limited to 1000 pods.
You might want to count also control plane pods in your 1000 available pods. Usually located in the namespace kube-system it can include (but is not limited to) :
node log exporters (1 per node)
metrics exporters
kube proxy (usually 1 per node)
kubernetes dashboard
DNS (scaling according to the number of nodes)
controllers like certmanager
A pretty good rule of thumb could be 80-90 application pods per node, so 10 nodes will be able to handle 800-900 clients considering you don't have any other big deployment on those nodes.
If you're using containers in order to gain perfs, creating node VMs will be against your goal. But if you're using containers as a way to deploy coherent environments and scale stateless applications then using VMs as node can make sense.
There are no magic rules and your context will dictate what to do.
As managing a virtualization cluster and a kubernetes cluster may skyrocket your infrastructure complexity, maybe kubernetes is not the most efficient tool to manage your workload.
You may also want to take a look at Nomad wich does not seem to have those kind of limitations and may provide features that are closer to your needs.

HPA + Cluster Autoscaler + OPA within Federated Kubernetes cluster on GKE

I'm setting up a federated kubernetes cluster with kubefed on the Google Container Engine (GKE) 1.8.3-gke.0.
And it seems like for a good HPA and cluster autoscaler I have to use Open Policy Agent as a kubernetes Admission Controller because of this:
By default, replicas are spread equally in all the underlying
clusters. For example: if you have 3 registered clusters and you
create a Federated Deployment with spec.replicas = 9, then each
Deployment in the 3 clusters will have spec.replicas=3.
But in my case, the load would be dynamically changed in every region and every cluster should have dynamic pods number.
I can't find (or just can't see) examples or manuals regarding cases like mine. So, the question is:
What scenario should a policy have, if I have three clusters in my federated context, one for every region of GKE:
eu (1000 rps, nodes labeled with "region=eu")
us (200 rps, nodes labeled with "region=us")
asia (100 rps, nodes labeled with "region=asia")
It should be a single deployment to dynamically spread pods across those three clusters.
One pod should:
serve 100 rps
request 2 vCPUs + 2Gb RAM
be placed on a node solely (with anti-affinity)
How can I configure OPA to make that schema work, if this is possible?
Thanks in advance for any links to corresponding manuals.
What you are trying to do should be achivable through "Federated Horizontal Pod Autoscalers", one of their main use cases is exactly your scenario:
Quoting from the Requirements & Design Document of the Federated Pod Autoscaler:
Users can schedule replicas of same application, across the federated clusters, using replicaset (or deployment). Users however further might need to let the replicas be scaled independently in each cluster, depending on the current usage metrics of the replicas; including the CPU, memory and application defined custom metrics.
And from the actual documentation this passage from the conclusion describe the behaviour:
The use of federated HPA is to ensure workload replicas move to the cluster(s) where they are needed most, or in other words where the load is beyond expected threshold. The federated HPA feature achieves this by manipulating the min and max replicas on the HPAs it creates in the federated clusters. It actually relies on the in-cluster HPA controllers to monitor the metrics and update relevant fields [...] The federated HPA controller, on the other hand, monitors only the cluster-specific HPA object fields and updates the min replica and max replica fields of those in cluster HPA objects, which have replicas matching thresholds.
Therefore If I didn't misunderstood your needs, there is no reason to use a third product like Open Policy Agent or create policies.