How to calculate celery load correctly - celery

For example, I have a VPS with 2 shared CPUs, 10 000 receivers, and a task that should not be executed more than 15 times per second. Also, if the request receives a 429 code then it needs to make the request again after 1800 seconds.
for i in receivers_arr:
send_message.delay(i)
#celery_app.task(ignore_result=True,
time_limit=5,
autoretry_for=(Exception,),
retry_backoff=1800,
retry_kwargs={'max_retries': 2},
retry_jitter=False,
rate_limit=1)
def send_message(reciever_id):
code = send(reciever_id)
if code == 429:
raise Exception
How to choose the right number of workers and concurrency? Also, how correctly am I using decorator arguments (at the moment I have 3 workers with 4 concurrency)? (the main task is to avoid RuntimeError: can't start new thread)

Related

Handleing AskTimeoutException on Akka based application

I have the following HTTP-based application that routes every request to an Akka Actor which uses a long chain of Akka Actors to process the request.
path("process-request") {
post {
val startedAtAsNano = System.nanoTime()
NonFunctionalMetrics.requestsCounter.inc()
NonFunctionalMetrics.requestsGauge.inc()
entity(as[Request]) { request =>
onComplete(distributor ? [Response](replyTo => Request(request, replyTo))) {
case Success(response) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.OK.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
complete(response)
case Failure(ex) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.INTERNAL_SERVER_ERROR.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
logger.warn(s"A general error occurred for request: $request, ex: ${ex.getMessage}")
complete(InternalServerError, s"A general error occurred: ${ex.getMessage}")
}
}
}
}
As you can see, I'm sending the distributor an ask request for response.
The problem is that on really high RPS, sometimes, the distributor fails with the following exception:
2022-04-16 00:36:26.498 WARN c.d.p.b.http.AkkaHttpServer - A general error occurred for request: Request(None,0,None,Some(EntitiesDataRequest(10606082,0,-1,818052,false))) with ex: Ask timed out on [Actor[akka://MyApp/user/response-aggregator-pool#1374579366]] after [5000 ms]. Message of type [com.dv.phoenix.common.pool.WorkerPool$Request]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
This is a typical non-informative Exception, the normal processing time is about 700 micros, 5 seconds its must be stuck somewhere at the pipeline since it cannot be that high.
I want to monitor this, I thought about adding Kamon integration which provides Akka Actors module with mailboxes, etc.
I tried to add the following configurations but its not worked for me:
https://kamon.io/docs/latest/instrumentation/akka/ask-pattern-timeout-warning/ (didn't show any effect)
Is there other suggestions to understand the cause for this issue on high RPS system?
Thanks!
The Kamon instrumentation is useful for finding how you got to the ask. It can be useful if you have a lot of places where an ask can time out, but otherwise it's not likely to tell you the problem.
This is because an ask timeout is nearly always a symptom of some other problem (the lone exception is if many asks could plausibly be done in a stream (e.g. in a mapAsync or ask stage) but aren't; that doesn't apply in this code). Assuming that the timeouts aren't caused by (e.g.) a database being down so you're getting no reply or a cluster failing (both of these are fairly obvious, thus my assumption), the cause of a timeout (any timeout, generally) is often having too many elements in a queue ("saturation").
But which queue? We'll start with the distributor, which is an actor processing messages one-at-a-time from its mailbox (which is a queue). When you say that the normal processing time is 700 micros, is that measuring the time the distributor spends handling a request (i.e. the time before it can handle the next request)? If so, and the distributor is taking 700 micros, but requests come in every 600 micros, this can happen:
time 0: request 0 comes in, processing starts in distributor (mailbox depth 0)
600 micros: request 1 comes in, queued in distributor's mailbox (mailbox depth 1)
700 micros: request 0 completes (700 micros latency), processing of request 1 begins (mailbox depth 0)
1200 micros: request 2 comes in, queued (mailbox depth 1)
1400 micros: request 1 completes (800 micros latency), processing of request 2 begins (mailbox depth 0)
1800 micros: request 3 comes in, queued (mailbox depth 1)
2100 micros: request 2 completes (900 micros latency), processing of request 3 begins (mailbox depth 0)
2400 micros: request 4 comes in, queued (mailbox depth 1)
2800 micros: request 3 completes (1000 micros latency), processing of request 4 begins (mailbox depth 0)
3000 micros: request 5 comes in, queued (mailbox depth 1)
3500 micros: request 4 completes (1100 micros latency), processing of request 5 begins (mailbox depth 0)
3600 micros: request 6 comes in, queued (mailbox depth 1)
4200 micros: request 7 comes in, queued, request 5 completes (1200 micros latency), processing of request 6 begins (mailbox depth 1)
4800 micros: request 8 comes in, queued (mailbox depth 2)
4900 micros: request 6 completes (1300 micros latency), processing of request 7 begins (mailbox depth 1)
5400 micros: request 9 comes in, queued (mailbox depth 2)
and so on: the latency and depth increase without bound. Eventually, the depth is such that requests spend 5 seconds (or more, even) in the mailbox.
Kamon has the ability to track the number of messages in the mailbox of an actor (it's recommended to only do this on specific actors). Tracking the mailbox depth of distributor in this case would show it growing without bound to confirm that this is happening.
If the distributor's mailbox is the queue that's getting too deep, first consider how request N can affect request N + 1. The one-at-a-time processing model of an actor is only strictly required when the response to a request can be affected by the request immediately prior to it. If a request only concerns some portion of the overall state of the system then that request can be handled in parallel with requests that do not concern any part of that portion. If there are distinct portions of the overall state such that no request is ever concerned with 2 or more portions, then responsibility for each portion of state can be offloaded to a specific actor and the distributor looks at each request only for long enough to determine which actor to forward the request to (note that this will typically not entail the distributor making an ask: it hands off the request and its the responsibility of the actor it hands off to (or that actor's designee...) to reply). This is basically what Cluster Sharding does under the hood, and it's also noteworthy that doing this will probably increase the latency under low load (because you are doing more work), but increases peak throughput by up to the number of portions of state.
If that's not a workable way to address the distributor's mailbox being saturated (viz. there's no good way to partition the state), then you can at least limit the time requests spend in the mailbox by including a "respond-by" field in the request message (e.g. for a 5 second ask timeout, you might require a response by 4900 millis after constructing the ask). When the distributor starts processing a message and the respond-by time has passed, it moves onto the next request: doing this effectively means that when the mailbox starts to saturate, the message processing rate increases.
Of course, it's possible that your distributor's mailbox isn't the queue that's getting saturated, or that if it is, it's not because the actor is spending too much time processing messages. It's possible that the distributor (or other actors needed for a response) aren't processing messages.
Actors run inside a dispatcher which has the ability to have some number of actors (or Future callbacks or other tasks, each of which can be viewed as equivalent to an actor which is spawned for processing a single message) processing a message at a given time. If there are more actors which have a message in their respective mailboxes than the number that can be processing a message, those actors are in a queue to be scheduled (note that this applies even if you happen to have a dispatcher which will spawn as many threads as it needs to process a message: since there are a limited number of CPU cores, the OS kernel scheduler's queue will take the role of the dispatcher queue). Kamon can track the depth of this queue. In my experience, it's more valuable to detect thread starvation (basically whether the time between task submission and when the task starts executing exceeds some threshold) is occurring. Lightbend's package of commercial tooling for use with Akka (disclaimer: I am employed by Lightbend) provides tools for detecting, with minimal overhead, whether starvation is occurring and providing other diagnostic information.
If thread starvation is being observed, and things like garbage collection pauses, or CPU throttling (e.g. due to running in a container) are ruled out, the primary cause of starvation is actors (or actor-like things) taking too long to process a message either because they are executing blocking I/O or are doing too much in the processing of a single message. If blocking I/O is the culprit, try to move the I/O to actors or futures running in a thread pool with far more threads than the number of CPU cores (some even advocate for an unbounded thread pool for this purpose). If it's a case of doing too much computation in processing a single message, look for spots in the processing where it makes sense to capture the state needed for the remainder of the computation in a message and send that message to yourself (this is basically equivalent to a coroutine yielding).

Celery prefetched tasks stuck behind other tasks

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.
Some background:
I have a task which is running potentially for a few hours.
Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
The broker used is Redis.
I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.
The problem definition:
Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.
Investigated solutions:
When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.
Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.
Any ideas/suggestions?
I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of #samdoolin. I will provide the full answer here for people that have the same issue as me.
Solution:
The solution provided by #samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.
In practice this would look something like (thanks again to #samdoolin):
class SingleTaskLoader(AppLoader):
def on_worker_init(self):
# called when the worker starts, before logging setup
super().on_worker_init()
"""
STEP 1:
monkey patch kombu.transport.virtual.base.QoS.can_consume()
to prefer to run a delegate function,
instead of the builtin implementation.
"""
import kombu.transport.virtual
builtin_can_consume = kombu.transport.virtual.QoS.can_consume
def can_consume(self):
"""
monkey patch for kombu.transport.virtual.QoS.can_consume
if self.delegate_can_consume exists, run it instead
"""
if delegate := getattr(self, 'delegate_can_consume', False):
return delegate()
else:
return builtin_can_consume(self)
kombu.transport.virtual.QoS.can_consume = can_consume
"""
STEP 2:
add a bootstep to the celery Consumer blueprint
to supply the delegate function above.
"""
from celery import bootsteps
from celery.worker import state as worker_state
class Set_QoS_Delegate(bootsteps.StartStopStep):
requires = {'celery.worker.consumer.tasks:Tasks'}
def start(self, c):
def can_consume():
"""
delegate for QoS.can_consume
only fetch a message from the queue if the worker has
no other messages
"""
# note: reserved_requests includes active_requests
return len(worker_state.reserved_requests) == 0
# types...
# c: celery.worker.consumer.consumer.Consumer
# c.task_consumer: kombu.messaging.Consumer
# c.task_consumer.channel: kombu.transport.virtual.Channel
# c.task_consumer.channel.qos: kombu.transport.virtual.QoS
c.task_consumer.channel.qos.delegate_can_consume = can_consume
# add bootstep to Consumer blueprint
self.app.steps['consumer'].add(Set_QoS_Delegate)
# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)
Then we start the worker via the following line:
celery -A proj worker -c 4 --prefetch-multiplier -1
Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.
Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.
Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:
celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1
For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

Multiple concurrent connections with Vertx

I'm trying to build a web application that should be able to handle at least 15000 rps. Some of the optimizations I have done is increase the worker pool size to 20 and set an accept back log to 25000. Since I have set my worker pool size to 20; wil this help with the the blocking piece of code?
A worker pool size of 20 seems to be the default.
I believe the important question in your case is how long do you expect each request to run. On my side, I expect to have thousands of short-lived requests, each with a payload size of about 5-10KB. All of these will be blocking, because of a blocking database driver I use at the moment. I have increased the default worker pool size to 40 and I have explicitly set my deploy vertical instances using the following formulae:
final int instances = Math.min(Math.max(Runtime.getRuntime().availableProcessors() / 2, 1), 2);
A test run of 500 simultaneous clients running for 60 seconds, on a vert.x server doing nothing but blocking calls, produced an average of 6 failed requests out of 11089. My test payload in this case was ~28KB.
Of course, from experience I know that running my software in production would often produce results that I have not anticipated. Thus, the important thing in my case is to have good atomicity rules in place, so that I don't get half-baked or corrupted data in the database.

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

How to put a rate limit on a celery queue?

I read this in the celery documentation for Task.rate_limit:
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
How do I put a rate limit on a celery queue?
Turns out it cant be done at queue level for multiple workers.
IT can be done at queue level for 1 worker. Or at queue level for each worker.
So if u say 10 jobs/ minute on 5 workers. Your workers will process upto 50 jobs per minute collectively.
So to have only 10 jobs running at a time you either chose one worker. Or chose 5 workers with a limit of 2/minute.
Update: How to exactly put the limit in settings/configuration:
task_annotations = {'tasks.<task_name>': {'rate_limit': '10/m'}}
or change the same for all tasks:
task_annotations = {'*': {'rate_limit': '10/m'}}
10/m means 10 tasks per minute, /s would mean per second. More details here: Task annotations setting
hey I am trying to find a way to do rate limit on queue, and I find out Celery can't do that, however Celery can control the rate per tasks, see this:
http://docs.celeryproject.org/en/latest/userguide/workers.html#rate-limits
so for a workaround, maybe you can set up one tasks per queue(which makes sense in a lot of situations), and put the limit on task.
You can set this limit in the flower > worker pane.
there is a specified blank space for entering your limit there.
The format that is suggested to be used is also like the below:
The rate limits can be specified in seconds, minutes or hours by appending “/s”, >“/m” or “/h” to the value. Tasks will be evenly distributed over the specified >time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of >600ms between starting two tasks on the same worker instance.