Cadence configuration for worker threads and workflow - cadence-workflow

What will be the  ratio between worker and workflow and how to manage
the threads so that there should not be outage between workers and number of workflows
If I start more number of workflow following error is thrown
Not enough threads to execute workflows. If this message appears consistently either WorkerOptions.maxConcurrentWorklfowExecutionSize should be decreased or
WorkerOptions.maxWorkflowThreads increased.
the workflow in blocking state remains active in memory ??
Workflow in await state continuously checks for condition ??More the number of workflow in await state will keep worker occupied ??
In below example the thread is waiting for a signal,
 the number of workflows is scaled to a million/day and timetocloseWorkflow = 2days.
And the average time to trigger a signal is 1 day after the respective workflow is started
public class TestWorkflowImpl implements TestWorkflow {
private static final Logger logger = LoggerFactory.getLogger(TestWorkflow.class);
private int counter = 0;
private final CounterPrintActivity cpa = Workflow.newActivityStub(CounterPrintActivity.class);
#Override
#WorkflowMethod
public String startWorkflow() {
Workflow.await(() ->counter >= 1000);
return "Complete";
}
#Override
public int getCurrentStatus() {
return counter;
}
#Override
public void setCount(int setNum) {
logger.info("In signal");
counter = counter+setNum;
}
--

What will be the ratio between worker and workflow and how to manage the threads so that there should not be outage between workers and number of workflows If I start more number of workflow following error is thrown
There is no such ratio as blocked workflows don't consume worker memory at all (after they are pushed out of cache). So it is possible to have billions of blocked workflows and a single worker if those workflows don't make any progress.
Not enough threads to execute workflows. If this message appears consistently either WorkerOptions.maxConcurrentWorklfowExecutionSize should be decreased or WorkerOptions.maxWorkflowThreads increased.
maxWorkflowThreads defines how many threads all currently executing and cached workflows can use.
maxConcurrentWorklfowExecutionSize defines how many workflow tasks can execute in parallel.
The "Not enough threads to execute workflows" exception indicates that there are not enough threads to execute currently running workflow tasks. For example, if each workflow uses two threads and maxConcurrentWorklfowExecutionSize is 100 then maxWorkflowThreads should be at least 200. With such setup 0 workflows will be cached as all the threads would be consumed by the currently executing workflow tasks. So in general it is better to keep maxWorkflowThreads much higher than maxConcurrentWorklfowExecutionSize to support caching.
the workflow in blocking state remains active in memory ??
It remains cached until another workflow needs to make progress and kicks the cached workflow out. After that, the blocked workflow is loaded into worker memory when it receives some new event like timer, signal, or activity completion.
Workflow in await state continuously checks for condition ?? More the number of workflow in await state will keep worker occupied ??
It checks only when some new event is processed. When nothing is happening the check is not executed.
In below example the thread is waiting for a signal, the number of workflows is scaled to a million/day and timetocloseWorkflow = 2days. And the average time to trigger a signal is 1 day after the respective workflow is started
This scenario should work fine assuming that workers can keep up with the workflow task processing rate.

Related

Whether the workflow worker in uber-cadence has control of the number of coroutines?

If the workflow executes for a long time (for example, the workflow executes sleep), will a large number of coroutines be generated?
Cadence or Temporal workflow only needs a worker to generate the next steps to execute. When it is blocked waiting for an external event like a timer it doesn't consume any worker resources. So a single worker can process a practically unlimited number of workflows given that it can keep up with their execution rate.
As an optimization workflows are cached on a worker. But any of them can be kicked out of cache at any time without affecting their correctness.

Does reducing capacity of a resourcePool at a certain time/condition make the unit immediately stop and leave the model?

If I have a condition where a resource pool reduces capacity from 2 to 1 at a certain time of the model OR when the unit interacts with a certain number of different agents, will the unit that is being removed from the model stop what it's doing and leave? Or will it finish all of it's queued tasks? I would like it to finish all of it's queued tasks.
My code for the condition is as follows where Surgeons is the resourcePool and seizedAgents is a collection inside the Surgeon agent type:
if( unit.seizedAgents.stream().distinct().count() >= 17 ) {
Surgeons.set_capacity(1);;
}
If the capacity is dynamically reduced by calling set_capacity(), and the number of currently seized units exceeds the new capacity, the extra units will only be disposed of after they are released. The rest immediately
Thus units busy with a task will be disposed of only after completing the current task
Check the help for more details.

Execute all high priority cadence workflows before any low priority workflows

In the documentation at https://cadenceworkflow.io/docs/03_concepts/02_activities#activity-task-routing-through-task-lists it mentions that multiple priorities are supported by having One task list per priority and having a worker pool per priority. Under that implementation, there may still be low priority workflows that get executed before high priority workflows.
Is it possible to implement a priority system such that not a single workflow going to the low priority worker pool gets executed before the workflows going to the high priority workers are in progress?
In the most cases the priorities are useful not for workflows which are mostly blocked waiting for external events, but for activities.
If your rate of execution is relatively low you can have a separate "priority queue" workflow that would receive signals with requests to execute a certain activity and then maintain the priority queue of the requests in its memory. Then execute activities reading them from that queue. Upon an activity completion a reply signal would be send to the workflow that requested the execution.

Polling for external state transitions in Cadence workflows

I have a Cadence workflow where I need to poll an external AWS API until a particular resource transitions, which might take some amount of time. I assume I should make each individual 'checkStatus' request an Activity, and have the workflow perform the sleep/check loop. However, that means that I may have an unbounded number of activity calls in my workflow history. Is that worrisome? Is there a better way to accomplish this?
It depends on how frequently you want to poll.
For infrequent polls (every minute or slower) use the server side retry. Specify a RetryPolicy (or RetryOptions for Java) when invoking the activity. In the RetryPolicy specify an exponential coefficient of 1 and an initial interval of the poll frequency. Then fail the activity in case the polled resource is not ready and the server is going to retry it up to the specified retry policy expiration interval.
For very frequent polls of every few seconds or faster the solution is to implement the polling inside an activity implementation as a loop that polls and then sleeps for the poll interval. To ensure that the polling activity is restarted in a timely manner in case of worker failure/restart the activity has to heartbeat on every iteration. Use an appropriate RetryPolicy for the restarts of such failed activity.
In a rare case when the polling requires a periodic execution of a sequence of activities or activity arguments should change between retries a child workflow can be used. The trick is that a parent is not aware about a child calling continue as new. It only gets notified when a child completes or fails. So if a child executes the sequence of activities in a loop and calls continue as new periodically the parent is not affected until the child completes.

Azure Function and queue

I have a function:
public async static Task Run([QueueTrigger("efs-api-call-last-datetime", Connection = "StorageConnectionString")]DateTime queueItem,
[Queue("efs-api-call-last-datetime", Connection = "StorageConnectionString")]CloudQueue inputQueue,
TraceWriter log)
{
Then I have long process for processing message from queue. Problem is the message will be readded to queue after 30 seconds, while I process this message. I don't need to add this message and process it twice.
I would like to have code like:
try
{
// long operation
}
catch(Exception ex)
{
// something wrong. Readd this message in 1 minute
await inputQueue.AddMessageAsync(new CloudQueueMessage(
JsonConvert.SerializeObject(queueItem)),
timeToLive: null,
initialVisibilityDelay: TimeSpan.FromMinutes(1),
options: null,
operationContext: null
);
}
and prevent to readd it automatically. Any way to do it?
There are couple of things here.
1) When there are multiple queue messages waiting, the queue trigger retrieves a batch of messages and invokes function instances concurrently to process them. By default, the batch size is 16. But this is configurable in Host.json. You can set the batch size to 1 if you want to minimize the parallel execution. Microsoft document explains this.
2) As it is long running process so it seems your messages are not complete and the function might timeout and message are visible again. You should try to break down your function into smaller functions. Then you can use durable function which will chain the work you have to do.
Yes, you can dequeue same message twice.
Reasons:
1.Worker A dequeues Message B and invisibility timeout expires. Message B becomes visible again and Worker C dequeues Message B, invalidating Worker A's pop receipt. Worker A finishes work, goes to delete Message B and error is thrown. This is most common.
2.The lock on the original message that triggers the first Azure Function to execute is likely expiring. This will cause the Queue to assume that processing the message failed, and it will then use that message to trigger the Function to execute again.
3.In certain conditions (very frequent queue polling) you can get the same message twice on a GetMessage. This is a type of race condition that while rare does occur. Worker A and B are polling very quickly and hit the queue simultaneously and both get same message. This used to be much more common (SDK 1.0 time frame) under high polling scenarios, but it has become much more rare now in later storage updates (can't recall seeing this recently).
1 and 3 only happen when you have more than 1 worker.
Workaround:
Install azure-webjobs-sdk 1.0.11015.0 version (visible in the 'Settings' page of the Functions portal). For more details, you could refer to fixing queue visibility renewals