What exactly is a Cadence decision task? - cadence-workflow

Activity tasks are pretty easy to understand since it's executing an activity...but what is a decision task? Does the worker run through the workflow from beginning (using records of completed activities) until it hits the next "meaningful" thing it needs to do while making a "decision" on what needs to be done next?

My Opinions
Ideally users don't need to understand it!
However, decision/workflow Task a leaked technical details from Cadence/Temporal API.
Unfortunately, you won't be able to use Cadence/Temporal well if you don't fully understand it.
Fortunately, using iWF will keep you away from leakage. iWF provides a nice abstraction on top of Cadence/Temporal but keep the same power.
TL;DR
Decision is short for workflow decision.
A decision is a movement from one state to another in a workflow state machine. Essentially, your workflow code defines a state machine. This state machine must be a deterministic state machine for replay, so workflow code must be deterministic.
A decision task is a task for worker to execute workflow code to generate decision.
NOTE: in Temporal, decision is called "command", the workflow decision task is called "workflow task" which generates the "command"
Example
Let say we have this workflow code:
public string sampleWorkflowMethod(...){
var result = activityStubs.activityA(...)
if(result.startsWith("x"){
Workflow.sleep(...)
}else{
result = activityStubs.activityB(...)
}
return result
}
From Cadence/Temporal SDK's point of view, the code is a state machine.
Assuming we have an execution that the result of activityA is xyz, so that the execution will go to the sleep branch.
Then the workflow execution flow is like this graph.
Workflow code defines the state machine, and it's static.
Workflow execution will decide how to move from one state to another during the run time, based on the intput/result/and code logic
Decision is an abstraction in Cadence internal. During the workflow execution, when it change from one state to another, the decision is the result of that movement.
The abstraction is basically to define what needs to be done when execution moves from one state to another --- schedule activity, timer or childWF etc.
The decision needs to be deterministic --- with the same input/result, workflow code should make the same decision --- schedule activityA or B must be the same.
Timeline in the example
What happens during the above workflow execution:
Cadence service schedules the very first decision task, dispatched to a workflow worker
The worker execute the first decision task, and return the decision result of scheduling activityA to Cadence service. Then workflow stay there waiting.
As a result of scheduling activityA, an activity task is generated by Cadence service and the task is dispatched to an activity worker
The activity worker executes the activity and returns a result xyz to Cadence service.
As a result of receiving the activity result, Cadence service schedules the second decision task, and dispatch to a workflow worker.
The workflow worker execute the second decision task, and respond the decision result of scheduling a timer to Cadence service
On receiving the decision task respond, Cadence service schedules a timer
When the timer fires, Cadence service schedules the third decision task and dispatched to workflow worker again
The workflow worker execute the third decision task, and respond the result of completing the workflow execution successfully with result xyz.
Some more facts about decision
Workflow Decision is to orchestrate those other entities like activity/ChildWorkflow/Timer/etc.
Decision(workflow) task is to communicate with Cadence service, telling what is to do next. For example, start/cancel some activities, or complete/fail/continueAsNew a workflow.
There is always at most one outstanding(running/pending) decision task for each workflow execution. It's impossible to start one while another is started but not finished yet.
The nature of the decision task results in some non-determinism issue when writing Cadence workflow. For more details you can refer to the article.
On each decision task, Cadence Client SDK can start from very beginning to "replay" the code, for example, executing activityA. However, this replay mode won't generate the decision of scheduling activityA again. Because client knows that the activityA has been scheduled already.
However, a worker doesn't have to run the code from very beginning. Cadence SDK is smart enough to keep the states in memory, and wake up later to continue on previous states. This is called "Workflow Sticky Cache", because a workflow is sticky on a worker host for a period.
History events of the example:
1. WorkflowStarted
2. DecisionTaskScheduled
3. DecisionTaskStarted
4. DecisionTaskCompleted
5. ActivityTaskScheduled <this schedules activityA>
6. ActivityTaskStarted
7. ActivityTaskCompleted <this records the results of activityA>
8. DecisionTaskScheduled
9. DecisionTaskStarted
10. DecisionTaskCompleted
11. TimerStarted < this schedules the timer>
12. TimerFired
13. DecisionTaskScheduled
14. DecisionTaskStarted
15. DecisionTaskCompleted
16. WorkflowCompleted

TLDR; When a new external event is received a workflow task is responsible for determining which next commands to execute.
Temporal/Cadence workflows are executed by an external worker. So the only way to learn about which next steps a workflow has to take is to ask it every time new information is available. The only way to dispatch such a request to a worker is to put into a workflow task into a task queue. The workflow worker picks it up, gets workflow out of its cache, and applies new events to it. After the new events are applied the workflow executes producing a new set of commands. After the workflow code is blocked and cannot make any forward progress the workflow task is reported as completed back to the service. The list of commands to execute is included in the completion request.
Does the worker run through the workflow from beginning (using records of completed activities) until it hits the next "meaningful" thing it needs to do while making a "decision" on what needs to be done next?
This depends if a worker has the workflow object in its LRU cache. If workflow is in the cache, no recovery is needed and only new events are included in the workflow task. If object is not cached then the whole event history is shipped and the worker has to execute the workflow code from the beginning to get it to its current state. All commands produced while replaying past events are duplicates of previously produced commands and are ignored.
The above means that during a lifetime of a workflow multiple workflow tasks have to be executed. For example for a workflow that calls two activities in a sequence:
a();
b();
The tasks will be executed for every state transition:
-> workflow task at the beginning: command is ScheduleActivity "a"
a();
-> workflow task when "a" completes: command is ScheduleActivity "b"
b();
-> workflow task when "b" completes: command is CompleteWorkflowExecution
In the answer, I used terminology adopted by temporal.io fork of Cadence. Here is how the Cadence concepts map to the Temporal ones:
decision task -> workflow task
decision -> command, but it can also mean workflow task in some contexts
task list -> task queue

Related

will a workflow be executed by multiple workflow workers at the same time?

Under normal circumstances, will a workflow be executed by multiple workflow workers at the same time? Because multiple workflow workers can polldecision tasks to execute, if not, how does he do it?
No it will not.
There is only one pending workflow decision task at a time. When a workflow worker is working on a decision task, Cadence will not schedule another one until the current one completed, failed or timeout.
However, timeout is enforced by server, technically when a decision task timeout the worker is still working on it. But the results will not be accepted afterwards.
It depends on many factors. A workflow can be executed by a single worker if it is short. But it will be executed by many workers if it takes long enough to be pushed out of the worker cache or a worker fails/restarts.
But the same workflow is executed exactly once in all these situations.

Whether the workflow worker in uber-cadence has control of the number of coroutines?

If the workflow executes for a long time (for example, the workflow executes sleep), will a large number of coroutines be generated?
Cadence or Temporal workflow only needs a worker to generate the next steps to execute. When it is blocked waiting for an external event like a timer it doesn't consume any worker resources. So a single worker can process a practically unlimited number of workflows given that it can keep up with their execution rate.
As an optimization workflows are cached on a worker. But any of them can be kicked out of cache at any time without affecting their correctness.

Processing Groups of Results with Vertx - How to coordinate?

I have a job processing system where each job contains thousands of individual tasks that require different strategies to complete. The individual tasks make up the whole job. If all tasks have been completed, the job is marked as successfully completed and other steps are taken, if any of the tasks fail, the job must be marked as failed and other steps are taken, if the job times out the job must be marked as failed and other steps are taken.
Once all of the results for a job have been received, the next job can be fetched. The next job shouldn't be fetched while a job is currently being processed.
Here is the what the flow looks like:
The Job Polling Verticle publishes a job to the event bus, and the Job Processing Verticle publishes each task to the event bus. When the job strategy completes, it publishes the task result to the event bus.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The only way I can think to do this would be to have a global stateful object. But I don't think this is good design.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
I could do this with the global state, but again I don't think that's the right solution.
Does this verticle pattern make sense for what I'm trying to do?
First, let me try to address your questions. Then I'll try to explain what problems this design has.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The solution could be reference counting verticle. Each worker should emit a start message on event bus with jobId when it starts, and end message with jobId when it completes. Even if you have fan-out (those are the cases that you don't know how many workers there are), counting verticle will know that. In your diagram, "Job Post Processing Verticle" is a good candidate for this. It can maintain a counter, and only when it reaches zero, it should start the next job. That also helps avoiding actually sharing some memory reference.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
In the same verticle you can start a timer every time you get a new start message. If you get end message, cancel the timer. Otherwise, cancel current job and start again.
Now, this solution will work, but the design has two main flaws. One is the fact that you maintain all your flow in memory, it seems. If your application crashes, all progress is lost, and it's not clear how you record it. Maybe polling Jobs table in DB would actually be better, since your job execution is sequential anyway.
Second point is the fact that all those timeouts and reference counting is homemade implementation of structured concurrency. Maybe you should take a look at something like Kotlin coroutines for that, at it will handle many of your problems for you.

Setting up a Job Schedule

I currently have a setup that creates a job and then collect some metrics about the tasks in the job. I want to do something similar, but by setting a job schedule instead. In particular, I want to set a job schedule that wakes up at a recurrence interval that I specify, and run the same code that I was running when creating a job. What's the best way to go about doing that?
It seems that there is a CloudJobSchedule that I could use to set up my job schedule, but this only lets me create say a job manager task, and specify few properties. How can I run external code on the jobs created by the Job schedule?
It could also help to clarify how the CloudJobSchedule works. Specifically, after I commit my job schedule, what would happen programmatically. Does the code just move sequentially and run the rest of the code. In this case, does it make sense to get a reference to the last job created by the job schedule and run code on the job returned?
You'll want to create a CloudJobSchedule. You can specify the recurrence in the Schedule.
If you only need to run a single task per recurrence, your job manager task can simply be the task you need to run. If you need to run multiple tasks per job recurrence, your job manager needs to have logic to submit tasks to Batch and monitor for completion (if necessary).
When you submit a job schedule to Batch, your client side code will continue running. The behavior is no different than if you were submitting a regular job. You can retrieve the last job run via JobScheduleExecutionInformation and the RecentJob property.

I want to have a queue that push task to worker (celeryd) depend on interval time setting

I 'm working of project that use celery, rabbitmq. I want to have right to control interval that queue push task to worker(celeryd).
It sounds like you're looking for this documentation on Periodic Tasks.
Essentially, you configure and run celerybeat, which fires off task executions at intervals.
Word of warning:
If it's undesirable to be running your task multiple times concurrently, I'd suggest you follow a task locking recipe. If your workers are busy or offline, you may end up with a backlog of periodic tasks.