We are using Kubernetes and we need to do "Smart partitioning" of data. We want to split the space between 1 to 1000 between n running Pods,
And each pod should know which part of the space is his to handle (for pooling partitioned tasks).
So for example, if we have 1 pod he will handle the whole space from 1-1000.
When we scale out to 3 pods, each of them will get the same share.
Pod 1 - will handle 1-333
Pod 2 - 334-667
Pod 3 667-1000
Right now the best way that we find to handle this issue is to create a Stateful-set, that pooling the number of running pods and his instance number and decide which part of the space he needs to handle.
Is there a smarter/built-in way in Kubernetes to partition the space between nodes in this manner?
Service fabric has this feature built-in.
There are NO native tools for scaling at the partition level in K8s yet.
Only custom solutions similar to what you have came up with in your original post.
Provide another customized way for doing this for your reference. Based on this tech blog of Airbnb
Given the list of pods and their names, each pod is able to
deterministically calculate a list of partitions that it should work
on. When we add or remove pods from the ReplicaSet, the pods will
simply pick up the change, and work on the new set of partitions
instead
How do they do is based on the their repo. I summarized the key components here (Note: the repo is written in Java).
Get how many pods running in the k8s namespace, and sort by pod name (code). Code snippet
String podName = System.getenv("K8S_POD_NAME");
String namespace = System.getenv("K8S_NAMESPACE");
NamespacedKubernetesClient namespacedClient = kubernetesClient.inNamespace(namespace);
ReplicaSet replicaSet;
// see above code link to know how to get activePods, remove it here because it is too long
int podIndex = activePods.indexOf(podName);
int numPods = activePods.size();
Every time you call the above code, you will have deterministic list of podIndex and numPods. Then, using this information to calculate the range this pod is responsible for
List<Integer> partitions = new ArrayList<>();
int split = spaceRange / numPods;
int start = podIndex * split;
int end = (podIndex == numPods - 1) ? spaceRange - 1 : ((podIndex + 1) * split) - 1;
for (int i = start; i <= end; i++) {
partitions.add(i);
}
Since the number of pods will be changed anytime, you may need a executorService.scheduleWithFixedDelay to periodically update the list as here
executorService.scheduleWithFixedDelay(this::updatePartitions, 0, 30, TimeUnit.SECONDS);
This approach is not the best, since if you set scheduleWithFixedDelay with 30 seconds, any pod change won't be captured within 30 seconds. Also, it is possible in a short period of time, two pods may be responsible for the same space, and you need to handle this special case in your business logics as Airbnb tech blog does.
Related
What is the simplest way to find out the Availability of a K8s service over a period of time, lets say 24h. Should I target a pod or find a way to calculate service reachability
I'd recommend to not approach it from a binary (is it up or down) but from a "how long does it take to serve requests" perspective. In other words, phrase your availability in terms of SLOs. You can get a very nice automatically generated SLO-based alter rules from PromTools. One concrete example rule from there, showing the PromQL part:
1 - (
sum(rate(http_request_duration_seconds_bucket{job="prometheus",le="0.10000000000000001",code!~"5.."}[30m]))
/
sum(rate(http_request_duration_seconds_count{job="prometheus"}[30m]))
)
Above captures the ratio of how long it took the service to serve non-500 (non-server errors, that is, assumed good responses) in less than 100ms to overall responses over the last 30 min with http_request_duration_seconds being the histogram, capturing the distribution of the requests of your service.
I have a kubernetes cluster in Amazon EKS, Autoscaling is set. So when there is load increase a new node spin-up in the cluster and spin-down with respect to load-running. We are monitoring it with Prometheus and send desired alerts with Alertmanager.
So help me with a query that will send alerts whenever Autoscaling is performed in my Cluster.
The logic is not so great, but this works for me in a non-EKS Self Hosted Kubernetes Cluster on AWS EC2s.
(group by (kubernetes_io_hostname, kubernetes_io_role) (container_memory_working_set_bytes ) * 0
The above query fetches the currently up nodes and multiplies them by 0,
or group by (kubernetes_io_hostname, kubernetes_io_role) (delta ( container_memory_working_set_bytes[1m]))) == 1
Here, it adds all nodes that existed in the last 1 minute through the delta() function. The default value of the nodes in the delta() function output will be 1, but the existing nodes will be overridden by the value 0, because of the OR precedence. So finally, only the newly provisioned node(s) will have the value 1, and they will get filtered by the equality condition. You can also extract whether the new node is master/worker by the kubernetes_io_role label
Full Query:
(group by (kubernetes_io_hostname, kubernetes_io_role) (container_memory_working_set_bytes ) * 0 or group by (kubernetes_io_hostname, kubernetes_io_role) (delta ( container_memory_working_set_bytes[1m]))) == 1
You can reverse this query for downscaling of nodes, although that will collide with the cases in which your Kubernetes node Shuts Down Abruptly due to reasons other than AutoScaling
I don't know if this is a bug/issue or question.
Proposal
Use case. Why is this important?
For monitoring multiple containers with resource/requests limits in pods on kubernetes.
Bug Report
What did you do?
I'm write a query to get a percentage of usage based on the maximum CPU usage and that we have max on the limits (resource and request) of the pod.
We have this problem affecting our query:
1. When we take a pod and it have 2 containers with configured resource/requets limits, it is not possible to take the value of resource/requests limits. 2. Show the value of the pod (resource/requests), but it can have multiple replicas.
max_over_time(sum(rate(container_cpu_usage_seconds_total{namespace="alpha",container_name!="POD", container_name!=""}[1m])) [1h:1s]) / on(pod) kube_pod_container_resource_requests_cpu_cores{namespace="alpha"}
Error executing query:found duplicate series for the match group {pod="heapster-65ddcb7b4c-vtl8j"} on the right hand-side of the operation: [{__name__="kube_pod_container_resource_requests_cpu_cores", container="heapster-nanny", instance="kubestate-alpha.internal:80", job="k8s-prod-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"}, {__name__="kube_pod_container_resource_requests_cpu_cores", container="heapster", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"}];many-to-many matching not allowed: matching labels must be unique on one side.
We try solutions like: [Using group_left to calculate label proportions]
sum without (container) (rate(kube_pod_container_resource_requests_cpu_cores{pod="heapster-65ddcb7b4c-vtl8j"}[1m]))
But if the value is set in the container, the result of the query is 0.
For not being able to calculate.
kube_pod_container_resource_requests_cpu_cores{pod="heapster-65ddcb7b4c-vtl8j"}
kube_pod_container_resource_requests_cpu_cores{container="heapster", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"} 0.084
kube_pod_container_resource_requests_cpu_cores{container="heapster-nanny", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"} 0.05
Standard output for the kube_pod_container_resource_requests_cpu_cores command
What did you expect to see?
The sum of what is set in the containers in the pod.
What did you see instead? Under which circumstances?
Prometheus UI
Environment
System information:
Linux 4.4.0-1096-aws x86_64
Prometheus version:
v2.15.2
I am trying to horizontally autoscale a workload not only by custom metrics but also by algorithm that differs from algorithm described here
1/ is that possible?
2/ if it is not, and assuming i don't mind creating a container that does the autoscaling for me instead of HPA, what API should i call to do the equivalent of kubectl scale deployments/<name> --replicas=<newDesired> ?
here's the use-case:
1/ the workload consumes a single request from a queue, handles them, when done removes the item it handled, and consumes the next message.
2/ when there's more than 0 messages ready - I'd like to scale up to the number of messages ready (or max scale if it is larger).
when there's 0 messages being processed - i'd like to scale down to 0.
getting the messages ready/ messages being processed to metrics server is not an issue.
getting HPA to scale by "messages ready" is not an issue either.
but...
HPA algorithm scales gradually...
when i place 10 items in queue - it first to 4, then to 8 then to 10.
it also scales down gradually, and when it scales down it can terminate a pod that was processing - thus increasing the "ready" and causing a scale-up.
a node.js code that i would have run had i known the api to call (intead of HPA):
let desiredToSet = 0;
if (!readyMessages && !processingMessages) {
//if we have nothing in queue and all workers completed their work - we can scale down to minimum
//we like it better than reducing slowly as this way we are not risking killing a worker that's working
desiredToSet = config.minDesired;
}
else {
//messages ready in the queue, increase number of workers up to max allowed
desiredToSet = Math.max(Math.min(readyMessages + processingMessages, config.maxDesired), currentDeploymentReplicas);
}
//no point in sending a request to change, if nothing changed
if (desiredToSet !== currentDeploymentReplicas) {
<api to set desiredToSet of deployment to come here>;
}
1) I don't think it's possible. The HPA controller is built-into Kubernetes and I don't think its algorithm can be extended/replaced.
2) Yes, you can create a custom controller that does the job of the HPA with your own algorithm. To scale the Deployment up and down through the Kubernetes API, you manipulate the Scale sub-resource of the Deployment.
Concretely, to scale the Deployment to a new number of replicas, you make the following request:
PUT /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
With a Scale resource (containing the desired replica count) as a body argument, as described in the API reference.
I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.