custom Kubernetes HPA algorithm - kubernetes

I am trying to horizontally autoscale a workload not only by custom metrics but also by algorithm that differs from algorithm described here
1/ is that possible?
2/ if it is not, and assuming i don't mind creating a container that does the autoscaling for me instead of HPA, what API should i call to do the equivalent of kubectl scale deployments/<name> --replicas=<newDesired> ?
here's the use-case:
1/ the workload consumes a single request from a queue, handles them, when done removes the item it handled, and consumes the next message.
2/ when there's more than 0 messages ready - I'd like to scale up to the number of messages ready (or max scale if it is larger).
when there's 0 messages being processed - i'd like to scale down to 0.
getting the messages ready/ messages being processed to metrics server is not an issue.
getting HPA to scale by "messages ready" is not an issue either.
but...
HPA algorithm scales gradually...
when i place 10 items in queue - it first to 4, then to 8 then to 10.
it also scales down gradually, and when it scales down it can terminate a pod that was processing - thus increasing the "ready" and causing a scale-up.
a node.js code that i would have run had i known the api to call (intead of HPA):
let desiredToSet = 0;
if (!readyMessages && !processingMessages) {
//if we have nothing in queue and all workers completed their work - we can scale down to minimum
//we like it better than reducing slowly as this way we are not risking killing a worker that's working
desiredToSet = config.minDesired;
}
else {
//messages ready in the queue, increase number of workers up to max allowed
desiredToSet = Math.max(Math.min(readyMessages + processingMessages, config.maxDesired), currentDeploymentReplicas);
}
//no point in sending a request to change, if nothing changed
if (desiredToSet !== currentDeploymentReplicas) {
<api to set desiredToSet of deployment to come here>;
}

1) I don't think it's possible. The HPA controller is built-into Kubernetes and I don't think its algorithm can be extended/replaced.
2) Yes, you can create a custom controller that does the job of the HPA with your own algorithm. To scale the Deployment up and down through the Kubernetes API, you manipulate the Scale sub-resource of the Deployment.
Concretely, to scale the Deployment to a new number of replicas, you make the following request:
PUT /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
With a Scale resource (containing the desired replica count) as a body argument, as described in the API reference.

Related

Celery prefetched tasks stuck behind other tasks

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.
Some background:
I have a task which is running potentially for a few hours.
Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
The broker used is Redis.
I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.
The problem definition:
Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.
Investigated solutions:
When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.
Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.
Any ideas/suggestions?
I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of #samdoolin. I will provide the full answer here for people that have the same issue as me.
Solution:
The solution provided by #samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.
In practice this would look something like (thanks again to #samdoolin):
class SingleTaskLoader(AppLoader):
def on_worker_init(self):
# called when the worker starts, before logging setup
super().on_worker_init()
"""
STEP 1:
monkey patch kombu.transport.virtual.base.QoS.can_consume()
to prefer to run a delegate function,
instead of the builtin implementation.
"""
import kombu.transport.virtual
builtin_can_consume = kombu.transport.virtual.QoS.can_consume
def can_consume(self):
"""
monkey patch for kombu.transport.virtual.QoS.can_consume
if self.delegate_can_consume exists, run it instead
"""
if delegate := getattr(self, 'delegate_can_consume', False):
return delegate()
else:
return builtin_can_consume(self)
kombu.transport.virtual.QoS.can_consume = can_consume
"""
STEP 2:
add a bootstep to the celery Consumer blueprint
to supply the delegate function above.
"""
from celery import bootsteps
from celery.worker import state as worker_state
class Set_QoS_Delegate(bootsteps.StartStopStep):
requires = {'celery.worker.consumer.tasks:Tasks'}
def start(self, c):
def can_consume():
"""
delegate for QoS.can_consume
only fetch a message from the queue if the worker has
no other messages
"""
# note: reserved_requests includes active_requests
return len(worker_state.reserved_requests) == 0
# types...
# c: celery.worker.consumer.consumer.Consumer
# c.task_consumer: kombu.messaging.Consumer
# c.task_consumer.channel: kombu.transport.virtual.Channel
# c.task_consumer.channel.qos: kombu.transport.virtual.QoS
c.task_consumer.channel.qos.delegate_can_consume = can_consume
# add bootstep to Consumer blueprint
self.app.steps['consumer'].add(Set_QoS_Delegate)
# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)
Then we start the worker via the following line:
celery -A proj worker -c 4 --prefetch-multiplier -1
Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.
Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.
Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:
celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1
For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

Multiple containers with resource/requests limits in pods on kubernetes return 0

I don't know if this is a bug/issue or question.
Proposal
Use case. Why is this important?
For monitoring multiple containers with resource/requests limits in pods on kubernetes.
Bug Report
What did you do?
I'm write a query to get a percentage of usage based on the maximum CPU usage and that we have max on the limits (resource and request) of the pod.
We have this problem affecting our query:
1. When we take a pod and it have 2 containers with configured resource/requets limits, it is not possible to take the value of resource/requests limits. 2. Show the value of the pod (resource/requests), but it can have multiple replicas.
max_over_time(sum(rate(container_cpu_usage_seconds_total{namespace="alpha",container_name!="POD", container_name!=""}[1m])) [1h:1s]) / on(pod) kube_pod_container_resource_requests_cpu_cores{namespace="alpha"}
Error executing query:found duplicate series for the match group {pod="heapster-65ddcb7b4c-vtl8j"} on the right hand-side of the operation: [{__name__="kube_pod_container_resource_requests_cpu_cores", container="heapster-nanny", instance="kubestate-alpha.internal:80", job="k8s-prod-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"}, {__name__="kube_pod_container_resource_requests_cpu_cores", container="heapster", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"}];many-to-many matching not allowed: matching labels must be unique on one side.
We try solutions like: [Using group_left to calculate label proportions]
sum without (container) (rate(kube_pod_container_resource_requests_cpu_cores{pod="heapster-65ddcb7b4c-vtl8j"}[1m]))
But if the value is set in the container, the result of the query is 0.
For not being able to calculate.
kube_pod_container_resource_requests_cpu_cores{pod="heapster-65ddcb7b4c-vtl8j"}
kube_pod_container_resource_requests_cpu_cores{container="heapster", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"} 0.084
kube_pod_container_resource_requests_cpu_cores{container="heapster-nanny", instance="kubestate-alpha.internal:80", job="k8s-alpha-http", namespace="alpha", node="ip-99-990-0-99.sa-east-1.compute.internal", pod="heapster-65ddcb7b4c-vtl8j"} 0.05
Standard output for the kube_pod_container_resource_requests_cpu_cores command
What did you expect to see?
The sum of what is set in the containers in the pod.
What did you see instead? Under which circumstances?
Prometheus UI
Environment
System information:
Linux 4.4.0-1096-aws x86_64
Prometheus version:
v2.15.2

Vert.x unfair verticle redeployment after node crash

I've been doing recently some experiments on the behavior of Vert.x and verticles in HA mode. I observed some weaknesses on how Vert.x dispatches the load on various nodes.
1. One node in a cluster crashes
Imagine a configuration with a cluster of some Vert.x nodes (say 4 or 5, 10, whatever), each having some hundreds or thousands verticles. If one node crashes, only one of the remaining nodes will restart all the verticles that had been deployed on the crashed node. Moreover, there is no guarantee that it will be the node with the smallest number of deployed verticles. This is unfair and in worst case, the same node would get all of the verticles from nodes that have crashed before, probably leading to a domino crash scenario.
2. Adding a node to a heavily loaded cluster
Adding a node to a heavily loaded cluster doesn't help to reduce the load on other nodes. Existing verticles are not redistributed on the new node and new verticles are created on the node that invokes the vertx.deployVerticle().
While the first point allows, within some limits, high availability, the second point breaks the promise of simple horizontal scalability.
I may be very possibly wrong: I may have misunderstood something or my configurations are maybe faulty. This question is about confirming this behavior and your advises about how to cope with it or point out my errors. Thanks in for your feedback.
This is how I create my vertx object:
VertxOptions opts = new VertxOptions()
.setHAEnabled(true)
;
// start vertx in cluster mode
Vertx.clusteredVertx(opts, vx_ar -> {
if (vx_ar.failed()) {
...
}
else {
vertx = vertx = vx_ar.result();
...
}
});
and this is how I create my verticles:
DeploymentOptions depOpt = new DeploymentOptions()
.setInstances(1).setConfig(prm).setHa(true);
// deploy the verticle
vertx
.deployVerticle("MyVerticle", depOpt, ar -> {
if(ar.succeeded()) {
...
}
else {
...
}
});
EDIT on 12/25/2019: After reading Alexey's comments, I believe I probably wasn't clear.
By promise of simple horizontal scalability I wasn't meaning that redistributing load upon insertion of a
new node is simple. I was meaning Vert.x promise to the developer that
what he needs to do to have his application to scale horizontally would be
simple. Scale is the very first argument on Vert.x home page, but, you're right, after re-reading carefully there's nothing about horizontal scaling on newly added nodes. I believe I was too much influenced by Elixir or Erlang. Maybe Akka provides this on the JVM, but I didn't try.
Regarding second comment, it's not (only) about the number of requests per second. The load I'm considering here is just the number of verticles "that are doing nothing else that waiting for a message". In a further experiment I can will make this verticle do some work and I will send an update. For the time being, imagine long living verticles that represent in memory actually connected user sessions on a backend. The system runs on 3 (or whatever number) clustered nodes each hosting few thousands (or whatever more) of sessions/verticles. From this state, I added a new node and waited until it is fully integrated in the cluster. Then I killed one of the first 3 nodes. All verticles are restarted fine but only on one node which, moreover, is not guaranteed to be the "empty" one. The destination node seems actually to be chosen at random : I did several tests and I have even observed verticles from all killed nodes being restarted on the same node. On a real platform with sufficient load, that would probably lead to a global crash.
I believe that implementing in Vert.x a fair restart of verticles, ie, distribute the verticles on all remaining nodes based on a given measure of their load (CPU, RAM, #of verticles, ...) would be simpler (not simple) than redistributing the load on a newly inserted node as that would probably require the capability for a scheduler to "steal" verticles from another one.
Yet, on a production system, not being "protected" by some kind of fair distribution of workload on the cluster may lead to big issues and as Vert.x is quite mature I was surprised by the outcome of my experiments, thus thinking I was doing something wrong.

FlowFiles stuck in the queue in NiFi Cluster

I am currently running NiFi 1.9.2 in a clustered environment with 3 nodes. Recently what I have noticed is that the flow seems to get stuck. The queue shows that there are items in the queue, but nothing is going to the downstream processor. When I list the items in the queue, I get "The queue has no FlowFiles".
The queue in this case is set to load balance with round robin. If I stop the downstream processor, and change the configuration on the queue to not to load balance, and then switch it back to round robin again, the queue items distribute to the other two nodes, and I can see the flow files when I list the items in the queue. However, it only shows items as being in two of the nodes. When I restart the downstream processor, the 2/3 of the items get processed leaving the 1/3 that would be on the node whose queue items I cannot see. This behavior seems to persist even after restarting the cluster service.
If I change the queue to not to load balance, then everything seems to get put on a good node, and the queue get emptied. So it looks like there might be something not correct on my first node.
Any suggestions on what to try?
Thanks,
-tj
You should check disk usages. If usage rate of the disk nifi located is equal or higher than the "nifi.content.repository.archive.max.usage.percentage" setting in nifi.properties file, you may see this NiFi's strange behavior. If you have this kind of situation, you can try to delete old log files of NiFi

How to partitions space between n pods in kubernetes

We are using Kubernetes and we need to do "Smart partitioning" of data. We want to split the space between 1 to 1000 between n running Pods,
And each pod should know which part of the space is his to handle (for pooling partitioned tasks).
So for example, if we have 1 pod he will handle the whole space from 1-1000.
When we scale out to 3 pods, each of them will get the same share.
Pod 1 - will handle 1-333
Pod 2 - 334-667
Pod 3 667-1000
Right now the best way that we find to handle this issue is to create a Stateful-set, that pooling the number of running pods and his instance number and decide which part of the space he needs to handle.
Is there a smarter/built-in way in Kubernetes to partition the space between nodes in this manner?
Service fabric has this feature built-in.
There are NO native tools for scaling at the partition level in K8s yet.
Only custom solutions similar to what you have came up with in your original post.
Provide another customized way for doing this for your reference. Based on this tech blog of Airbnb
Given the list of pods and their names, each pod is able to
deterministically calculate a list of partitions that it should work
on. When we add or remove pods from the ReplicaSet, the pods will
simply pick up the change, and work on the new set of partitions
instead
How do they do is based on the their repo. I summarized the key components here (Note: the repo is written in Java).
Get how many pods running in the k8s namespace, and sort by pod name (code). Code snippet
String podName = System.getenv("K8S_POD_NAME");
String namespace = System.getenv("K8S_NAMESPACE");
NamespacedKubernetesClient namespacedClient = kubernetesClient.inNamespace(namespace);
ReplicaSet replicaSet;
// see above code link to know how to get activePods, remove it here because it is too long
int podIndex = activePods.indexOf(podName);
int numPods = activePods.size();
Every time you call the above code, you will have deterministic list of podIndex and numPods. Then, using this information to calculate the range this pod is responsible for
List<Integer> partitions = new ArrayList<>();
int split = spaceRange / numPods;
int start = podIndex * split;
int end = (podIndex == numPods - 1) ? spaceRange - 1 : ((podIndex + 1) * split) - 1;
for (int i = start; i <= end; i++) {
partitions.add(i);
}
Since the number of pods will be changed anytime, you may need a executorService.scheduleWithFixedDelay to periodically update the list as here
executorService.scheduleWithFixedDelay(this::updatePartitions, 0, 30, TimeUnit.SECONDS);
This approach is not the best, since if you set scheduleWithFixedDelay with 30 seconds, any pod change won't be captured within 30 seconds. Also, it is possible in a short period of time, two pods may be responsible for the same space, and you need to handle this special case in your business logics as Airbnb tech blog does.