Polling for external state transitions in Cadence workflows - cadence-workflow

I have a Cadence workflow where I need to poll an external AWS API until a particular resource transitions, which might take some amount of time. I assume I should make each individual 'checkStatus' request an Activity, and have the workflow perform the sleep/check loop. However, that means that I may have an unbounded number of activity calls in my workflow history. Is that worrisome? Is there a better way to accomplish this?

It depends on how frequently you want to poll.
For infrequent polls (every minute or slower) use the server side retry. Specify a RetryPolicy (or RetryOptions for Java) when invoking the activity. In the RetryPolicy specify an exponential coefficient of 1 and an initial interval of the poll frequency. Then fail the activity in case the polled resource is not ready and the server is going to retry it up to the specified retry policy expiration interval.
For very frequent polls of every few seconds or faster the solution is to implement the polling inside an activity implementation as a loop that polls and then sleeps for the poll interval. To ensure that the polling activity is restarted in a timely manner in case of worker failure/restart the activity has to heartbeat on every iteration. Use an appropriate RetryPolicy for the restarts of such failed activity.
In a rare case when the polling requires a periodic execution of a sequence of activities or activity arguments should change between retries a child workflow can be used. The trick is that a parent is not aware about a child calling continue as new. It only gets notified when a child completes or fails. So if a child executes the sequence of activities in a loop and calls continue as new periodically the parent is not affected until the child completes.

Related

Whether the workflow worker in uber-cadence has control of the number of coroutines?

If the workflow executes for a long time (for example, the workflow executes sleep), will a large number of coroutines be generated?
Cadence or Temporal workflow only needs a worker to generate the next steps to execute. When it is blocked waiting for an external event like a timer it doesn't consume any worker resources. So a single worker can process a practically unlimited number of workflows given that it can keep up with their execution rate.
As an optimization workflows are cached on a worker. But any of them can be kicked out of cache at any time without affecting their correctness.

Resume Cadence Workflow based on signal without blocking the thread

We want to build a workflow which contains below steps in that order
Execute some synchronous activities.
Trigger an external operation via kafka event.
Listen to the kafka events for the result of the operation.
Execute some other activities based on the result.
Kafka may contain events not related to workflow, so we need a separate workflow to filter the events for that particular workflow.
Using cadence I'm planning to split it into two workflows
Workflow1 : 1 -> 2 -> wait for signal -> 4
Workflow2 : 3 -> Call workflow1.signal
Is it possible to wait for a signal in workflow1 without actually blocking the thread, so that the thread can process another workflow in the meantime.
I think there is some misunderstanding on how Temporal/Cadence works. There is no requirement to not block a thread for other workflows to be able to make progress. Worker instance will have no problem dealing with such situation.
So I would recommend to block the thread in the workflow one to wait for the signal as it is the simplest way to solve your business requirements.
As a side note I don't understand why you need a second workflow. There is no need to have a workflow to filter Kafka events. You can do it directly in a Kafka consumer that signals the first workflow.
I have some experience writing Kafka/Kinesis consumers (not working with Cadence but plan to do so soon). My feeling is that you only need 1 consumer thread blocked and waiting for new events from the Kafka stream. And this consumer can live anywhere as long as it can talk to your Cadence system to send a signal to a workflow. For each Kafka message (after filter out non related), if it can be designed to contain all the information for the consumer to decide which workflow to signal, it will be very simple. If you have no control over what is in the message (sounds like you have an existing stream), it is a little trick. Your consumer may need to look up which workflow to call based on some other identifier in the message

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

Processing Groups of Results with Vertx - How to coordinate?

I have a job processing system where each job contains thousands of individual tasks that require different strategies to complete. The individual tasks make up the whole job. If all tasks have been completed, the job is marked as successfully completed and other steps are taken, if any of the tasks fail, the job must be marked as failed and other steps are taken, if the job times out the job must be marked as failed and other steps are taken.
Once all of the results for a job have been received, the next job can be fetched. The next job shouldn't be fetched while a job is currently being processed.
Here is the what the flow looks like:
The Job Polling Verticle publishes a job to the event bus, and the Job Processing Verticle publishes each task to the event bus. When the job strategy completes, it publishes the task result to the event bus.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The only way I can think to do this would be to have a global stateful object. But I don't think this is good design.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
I could do this with the global state, but again I don't think that's the right solution.
Does this verticle pattern make sense for what I'm trying to do?
First, let me try to address your questions. Then I'll try to explain what problems this design has.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The solution could be reference counting verticle. Each worker should emit a start message on event bus with jobId when it starts, and end message with jobId when it completes. Even if you have fan-out (those are the cases that you don't know how many workers there are), counting verticle will know that. In your diagram, "Job Post Processing Verticle" is a good candidate for this. It can maintain a counter, and only when it reaches zero, it should start the next job. That also helps avoiding actually sharing some memory reference.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
In the same verticle you can start a timer every time you get a new start message. If you get end message, cancel the timer. Otherwise, cancel current job and start again.
Now, this solution will work, but the design has two main flaws. One is the fact that you maintain all your flow in memory, it seems. If your application crashes, all progress is lost, and it's not clear how you record it. Maybe polling Jobs table in DB would actually be better, since your job execution is sequential anyway.
Second point is the fact that all those timeouts and reference counting is homemade implementation of structured concurrency. Maybe you should take a look at something like Kotlin coroutines for that, at it will handle many of your problems for you.

How to set Akka actors run only for specific time period?

I have a big task,which i break down into smaller task and analyse them. I have a basic model.
Master,worker and listener .
Master creates the tasks,give them to worker actors. Once an worker actor completes,it asks for another task from the master. Once all task is completed ,they inform the listener. They usually take around less than 2 minutes to complete 1000 tasks.
Now,Some time the time taken for some tasks might be more than others. I want to set timer for each task,and if a task takes more time,then worker task should be aborted by the master and the task has to be resubmitted later as new one. How to implement this? I can calculate the time taken by a worker task,but how Master actor keeps tab on time taken by all worker actors in real time?
One way of handling this would be for each worker, on receipt of a task to start on, sets a timeout before changing state to process the task, eg:
context.setReceiveTimeout(5 minutes) // for the '5 minutes' notation - import scala.concurrent.duration._
If the timeout is received, the worker can abort the task (or whatever other action you deem appropriate - eg. kill itself, or pass a notification message back to the master). Don't forget to cancel the timeout (set duration = Duration.Undefined) if the task is completed or the like.