Is there a Cadence metric that can help spot overloads for each specific activity worker? - cadence-workflow

My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.
Reading the docs I have found the following metrics for activity workers:
cadence_activity_scheduled_to_start_latency_bucket
cadence_activity_scheduled_to_start_latency_count
cadence_activity_scheduled_to_start_latency_sum
However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?
Example:
We have 4 different activity workers : A, B, C and D
We would like to scale independently A or B or C or D without impacting the others

Understand scheduled_to_start_latency
scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.
These are the potential hotspots when this latency got high:
The matching service is too hot to dispatch tasks -- in this case, need to confirm with CPU/memory of the matching nodes
The tasklist is overloaded because it defaults to have one partition which mapped to only one matching node: https://cadenceworkflow.io/docs/operation-guide/maintain/#scale-up-a-tasklist-using-scalable-tasklist-feature -- in this case, use task per second metrics to confirm the task rate of the tasklist
The activity worker is overloaded.
How to monitor activity worker being overloaded
CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.

Related

Is there a way to get the latency for how long it takes for a Cadence activity worker to pick up a new task?

Is there any way to get the latency for how long it takes for a worker to pick up a new task or how long it takes for an activity to be picked up by a worker?
There are two metrics reported on client side for this:
Activity scheduled to start latency
Decision scheduled to start latency
The similar metrics on server is “sync match latency” and “async match latency “.
Sync match means when the task is scheduled, a worker is actively polling.
So usually sync match latency can be used to measure how much load a task list is under

Celery prefetched tasks stuck behind other tasks

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.
Some background:
I have a task which is running potentially for a few hours.
Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
The broker used is Redis.
I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.
The problem definition:
Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.
Investigated solutions:
When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.
Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.
Any ideas/suggestions?
I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of #samdoolin. I will provide the full answer here for people that have the same issue as me.
Solution:
The solution provided by #samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.
In practice this would look something like (thanks again to #samdoolin):
class SingleTaskLoader(AppLoader):
def on_worker_init(self):
# called when the worker starts, before logging setup
super().on_worker_init()
"""
STEP 1:
monkey patch kombu.transport.virtual.base.QoS.can_consume()
to prefer to run a delegate function,
instead of the builtin implementation.
"""
import kombu.transport.virtual
builtin_can_consume = kombu.transport.virtual.QoS.can_consume
def can_consume(self):
"""
monkey patch for kombu.transport.virtual.QoS.can_consume
if self.delegate_can_consume exists, run it instead
"""
if delegate := getattr(self, 'delegate_can_consume', False):
return delegate()
else:
return builtin_can_consume(self)
kombu.transport.virtual.QoS.can_consume = can_consume
"""
STEP 2:
add a bootstep to the celery Consumer blueprint
to supply the delegate function above.
"""
from celery import bootsteps
from celery.worker import state as worker_state
class Set_QoS_Delegate(bootsteps.StartStopStep):
requires = {'celery.worker.consumer.tasks:Tasks'}
def start(self, c):
def can_consume():
"""
delegate for QoS.can_consume
only fetch a message from the queue if the worker has
no other messages
"""
# note: reserved_requests includes active_requests
return len(worker_state.reserved_requests) == 0
# types...
# c: celery.worker.consumer.consumer.Consumer
# c.task_consumer: kombu.messaging.Consumer
# c.task_consumer.channel: kombu.transport.virtual.Channel
# c.task_consumer.channel.qos: kombu.transport.virtual.QoS
c.task_consumer.channel.qos.delegate_can_consume = can_consume
# add bootstep to Consumer blueprint
self.app.steps['consumer'].add(Set_QoS_Delegate)
# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)
Then we start the worker via the following line:
celery -A proj worker -c 4 --prefetch-multiplier -1
Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.
Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.
Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:
celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1
For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.
I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.
I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job
I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application is what creates SparkContext sc and may be referred to as something you deploy with spark-submit. job is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across applications and Scheduling Within application. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application and you have two jobs
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application scheduler will use FIFO to allocate resources to all the tasks of whichever job happens to appear to scheduler as first. Tasks for the other job will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job.
Now suppose you set spark.scheduler.mode=FAIR for your application. From now on each job has a notion of pool it belongs to. If you do nothing then for every job pool label is "default". To set the label for your job you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as jobs are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that jobs from different pools will have resources to proceed.
However, you can go even further. By default, within each pool jobs are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and jobs from different threads. To change that you you need to change the pool in the xml file to have <schedulingMode>FAIR</schedulingMode>.
Given all that, if you just set spark.scheduler.mode=FAIR and let all the jobs fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO and have your jobs be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pools you need to define the concept of user which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application listens to JMS, then a message processor may set the pool label for each message processing job depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)
By default spark works with FIFO scheduler where jobs are executed in FIFO manner.
But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.
If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.
How Scheduler works with spark?
I'm assuming that you have your spark cluster on YARN.
When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.
How scheduler works?
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.
Spark internally uses FIFO/FCFS job scheduler. But, when you talk about the tasks, it works in a Round Robin fashion. It will be clear if we concentrate on the below example:
Suppose, the first job in Spark's own queue doesn't require all the resources of the cluster to be utilized; so, immediately second job in the queue will also start getting executed. Now, both jobs are running simultaneously. Each job has few tasks to be executed in order to execute the whole job. Assume, the first job assigns 10 tasks and the second one assigns 8. Then, those 18 tasks will share the CPU cycles of the whole cluster in a preemptive manner. If you want to further drill down, lets start with executors.
There will be few executors in the cluster. Assume the number is 6. So, in an ideal condition, each executor will be assigned 3 tasks and those 3 tasks will get same CPU time of the executors(separate JVM).
This is how spark internally schedules the tasks.

Least load scheduler

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.
How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.
I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.