How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)? - cadence-workflow

So there are so many timeout values:
For local activity:
ScheduleToClose timeout
For regular activity without retry:
ScheduleToStart timeout
ScheduleToClose timeout
StartToClose timeout
Heartbeat timeout
And then more values in retryOptions:
ExpirationInterval
InitialInterval
BackoffCoefficient
MaximumInterval
MaximumAttempts
And retryOptions can be applied onto localActivity or regular activity.
How do I use them together with what expectation?

TL;DR
The easiest way of using timeouts:
Regular Activity with retry:
Use StartToClose as timeout of each attempt
Leave ScheduleToStart and SchedueToClose empty
If StartToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
Regular Activity without retry:
Use ScheduleToClose for overall timeout
Leave ScheduleToStart and StartToClose empty
If ScheduleToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
LocalActivity without retry: Use ScheduleToClose for overall timeout
LocalActivity with retry:
Use ScheduleToClose as timeout of each attempt.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
More TL;DR
Because activity should be idempotent, all activity should set retry policy.
Temporal has set an infinite retry policy for any activity by default. Cadence should do the same IMO.
iWF also set default infinite retry for State APIs to match Temporal activity.
What and Why
Basics without Retry
Things are easier to understand in the world without retry. Because Cadence started from it.
ScheduleToClose timeout is the overall end-to-end timeout from a workflow's perspective.
ScheduleToStart timeout is the time that activity worker needed to start an activity. Exceeding this timeout, activity will return an ScheduleToStart timeout error/exception to workflow
StartToClose timeout is the time that an activity needed to run. Exceeding this will return
StartToClose to workflow.
Requirement and defaults:
Either ScheduleToClose is provided or both of ScheduleToStart and StartToClose are provided.
If only ScheduleToClose, then ScheduleToStart and StartToClose are default to it.
If only ScheduleToStart and StartToClose are provided, then ScheduleToClose = ScheduleToStart + StartToClose.
All of them are capped by workflowTimeout. (e.g. if workflowTimeout is 1hour, set 2 hour for ScheduleToClose will still get 1 hour :ScheduleToClose=Min(ScheduleToClose, workflowTimeout) )
So why are they?
You may notice that ScheduleToClose is only useful when
ScheduleToClose < ScheduleToStart + StartToClose. Because if ScheduleToClose >= ScheduleToStart+StartToClose the ScheduleToClose timeout is already enforced by the combination of the other two, and it become meaningless.
So the main use case of ScheduleToClose being less than the sum of two is that people want to limit the overall timeout of the activity but give more timeout for scheduleToStart or startToClose. This is extremely rare use case.
Also the main use case that people want to distinguish ScheduleToStart and StartToClose is that the workflow may need to do some special handling for ScheduleToStart timeout error. This is also very rare use case.
Therefore, you can understand why in TL;DR that I recommend only using ScheduleToClose but leave the other two empty. Because only in some rare cases you may need it. If you can't think of the use case, then you do not need it.
LocalActivity doesn't have ScheduleToStart/StartToClose because it's started directly inside workflow worker without server scheduling involved.
Heartbeat timeout
Heartbeat is very important for long running activity, to prevent it from getting stuck. Not only bugs can cause activity getting stuck, regular deployment/host restart/failure could also cause it. Because without heartbeat, Cadence server couldn't know whether or not the activity is still being worked on. See more details about here Solutions to fix stuck timers / activities in Cadence/SWF/StepFunctions
RetryOptions and Activity with Retry
First of all, here RetryOptions is for server side backoff retry -- meaning that the retry is managed automatically by Cadence without interacting with workflows. Because retry is managed by Cadence, the activity has to be specially handled in Cadence history that the started event can not written until the activity is closed. Here is some reference: Why an activity task is scheduled but not started?
In fact, workflow can do client side retry on their own. This means workflow will be managing the retry logic. You can write your own retry function, or there is some helper function in SDK, like Workflow.retry in Cadence-java-client. Client side retry will show all start events immediately, but there will be many events in the history when retrying for a single activity. It's not recommended because of performance issue.
So what do the options mean:
ExpirationInterval:
It replaces the ScheduleToClose timeout to become the actual overall timeout of the activity for all attempts.
It's also capped to workflow timeout like other three timeout options. ScheduleToClose = Min(ScheduleToClose, workflowTimeout)
The timeout of each attempt is StartToClose, but StartToClose defaults to ScheduleToClose like explanation above.
ScheduleToClose will be extended to ExpirationInterval:
ScheduleToClose = Max(ScheduleToClose, ExpirationInterval), and this happens before ScheduleToClose is copied to ScheduleToClose and StartToClose.
InitialInterval: the interval of first retry
BackoffCoefficient: self explained
MaximumInterval: maximum of the interval during retry
MaximumAttempts: the maximum attempts. If existing with ExpirationInterval, then retry stops when either one of them is exceeded.
Requirements and defaults:
Either MaximumAttempts or ExpirationInterval is required. ExpirationInterval is set to workflowTimeout if not provided.
Since ExpirationInterval is always there, and in fact it's more useful. Most of the time it's harder to use MaximumAttempts, because it's easily messed up with backoffCoefficient(e.g. when backoffCoefficient>1, the end to end timeout can be really large if not careful). So I would recommend just use ExpirationInterval. Unless you really need it.

Related

Automatic heart-beat of Cadence workflow/activity

We have registered the activities with auto-heartbeat configuration EnableAutoHeartBeat: true and also configured the activity option config HeartbeatTimeout: 15Min in the activity implementation.
Do we still need to explicitly send heart-beat using activity.heartbeat() or is it automatically taken care by the go-client library?
If its automatic, then what will happen if the Activity is waiting for external API response say >15Min delay?
What will happen during the activity heart-beat if the worker executing the activity crashes or killed?
Will Cadence retry the activities due to heart-beat failures?
No, the SDK will take care of it with this config.
The auto heartbeat will send heartbeat for every interval — the interval is 80% * Heartbeat timeout(15 minutes in your case) so that the activity won’t get timeout as long as the activity worker is still live.
So you should use a smaller heatbeat timeout, ideally 10~20s is the best.
The activity will fail with “heartbeat timeout “
Yes if you have set a retry policy .
See my other answer for retry policy
How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?
Example
Let say your activity implementation is waiting on AWS SDK API for 2 hours (max API timeout configured) --
You should still use 10-20 s for heartbeat timeout, and also use 2 hours for activity start to close timeout.
Heartbeat timeout is for detecting the host is not live anymore so that the activity can be restarted as early as possible.
Imagine this case:
Because the API takes 2 hours, the activity worker got restarted during the 2 hours.
If the HB timeout is 15 minutes, then Cadence will retry this activity after 15 minutes.
If HB timeout is 10s, then Cadence can retry it after 10s, because it will get HB timeout within 10 seconds.

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

Polling for external state transitions in Cadence workflows

I have a Cadence workflow where I need to poll an external AWS API until a particular resource transitions, which might take some amount of time. I assume I should make each individual 'checkStatus' request an Activity, and have the workflow perform the sleep/check loop. However, that means that I may have an unbounded number of activity calls in my workflow history. Is that worrisome? Is there a better way to accomplish this?
It depends on how frequently you want to poll.
For infrequent polls (every minute or slower) use the server side retry. Specify a RetryPolicy (or RetryOptions for Java) when invoking the activity. In the RetryPolicy specify an exponential coefficient of 1 and an initial interval of the poll frequency. Then fail the activity in case the polled resource is not ready and the server is going to retry it up to the specified retry policy expiration interval.
For very frequent polls of every few seconds or faster the solution is to implement the polling inside an activity implementation as a loop that polls and then sleeps for the poll interval. To ensure that the polling activity is restarted in a timely manner in case of worker failure/restart the activity has to heartbeat on every iteration. Use an appropriate RetryPolicy for the restarts of such failed activity.
In a rare case when the polling requires a periodic execution of a sequence of activities or activity arguments should change between retries a child workflow can be used. The trick is that a parent is not aware about a child calling continue as new. It only gets notified when a child completes or fails. So if a child executes the sequence of activities in a loop and calls continue as new periodically the parent is not affected until the child completes.

When configuring RetryOnFailure what is the maxRetryDelay parameter

I'm using Entity Framework Core 2.2 and i decided to follow some blog sugestion and enable retry on failure:
services.AddDbContext<MyDbContext>( options =>
options.UseSqlServer(Configurations["ConnectionString"]),
sqlServerOptionsAction: sqlOptions =>
{
sqlOptions.EnableRetryOnFailure(
maxRetryCount: 10,
maxRetryDelay: TimeSpan.FromSeconds(5),
errorNumbersToAdd: null);
});
My question is what is the maxRetryDelay argument for?
I would expect it to be the delay time between retries, but the name implies its the maximum time, does that mean i can do my 10 retries 1 second apart and not 5 seconds apart as i desire?
The delay between retries is randomized up to the value specified by maxRetryDelay.
This is done to avoid multiple retries occuring at the same time and overwhelming a server. Imagine for example 10K requests to a web service failing due to a network issue and retrying at the same time after 15 seconds. The database server would get a sudden wave of 15K queries.
By randomizing the delay, retries are spread across time and client.
The delay for each retry is calculated by ExecutionStragegy.GetNextDelay. The source shows it's a random exponential backoff.
The default SqlServerRetryingExecutionStrategy uses that implementation. A custom retry strategy could use a different implementation

How to timeout idle user sessions in Redshift?

I am looking for a way to terminate user sessions that have been inactive or open for an arbitrary amount of time in Redshift. I noticed that in STV_SESSIONS I have a large number of sessions open, often for the same user, sometimes having been initialized days earlier. While I understand that this might be a symptom of a larger issue with the way some things close out of Redshift, I was hoping for a configurable timeout solution.
In the AWS documentation I found PG_TERMINATE_BACKEND (http://docs.aws.amazon.com/redshift/latest/dg/PG_TERMINATE_BACKEND.html), but I was hoping for a more automatic solution.
the timeout is only for timing out queries and not for a session.
Timeout (ms)
The maximum time, in milliseconds, queries can run before being canceled. If a read-only query, such as a SELECT statement, is canceled due to a WLM timeout, WLM attempts to route the query to the next matching queue based on the WLM Queue Assignment Rules. If the query doesn't match any other queue definition, the query is canceled; it is not assigned to the default queue. For more information, see WLM Query Queue Hopping. WLM timeout doesn’t apply to a query that has reached the returning state. To view the state of a query, see the STV_WLM_QUERY_STATE system table.
JSON property: max_execution_time
You can use Workload Management Configuration in AWS redshift. Where you can set the user group, query group, and timeout sessions. You can group all the same users together and assign a group name to them and set the timeout session for them. This is how I do it. Set the Query queue, based on your priority and then set concurrency level for the user group and the timeout in ms.
For more information, you can refer to AWS documentation.
Source - Workload Management
- Configuring Workload Management
Its pretty easy and straight forward.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
You can use the newly introduced idle session timeout feature in Redshift. It is available both when creating a user and post creation (using Alter statement). Lookup the SESSION TIMEOUT parameter.