Retry task without increment retry counter - celery

Is it possible to retry tasks without increment a retry counter?
My task calling some http backend, and if it is down for some reason, I do not want to lose my tasks.

Set max_retries = None and it will never stop retrying. Pass it as a keyword to the task decorator or to the retry call itself.
Documentation

Related

How to properly handle race condition caused by retry worker

In one of the services we had some connection issues and we are getting random timeouts (we think it is because of the client library. it is one of the caching services). We decided to handle it by putting it in the queue and retrying on a separate worker until we solve the underlying issue.
However, there is a case. let's say we want to put the value "A" to cache. but it fails. so we put it in the queue to retry again. but during this time user fire a delete request to remove that data and we call it without any timeouts (no error, but no record to delete as well). then our retry strategy writes that data to cache (which is supposed to be deleted and not be there).
How would we handle this scenario? I first thought maybe we can raise an error if delete doesn't delete anything but then I see it also has so many complications and can end with an endless retry even
It appear as the issue is coming as you are doing actual action on main thread and if it fails then only doing retry through queue by worker thread.
If you do actual action as well through worker thread as well through queue then issue will be resolved.
Or 2nd solution is, you can track all the keys that are in queue for retry. If there is any action related to key already in queue then queue the actual action as well. Like delete should be queue as the action for A as retry action on A is already queue.
2nd solution is little inefficient.

How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?

So there are so many timeout values:
For local activity:
ScheduleToClose timeout
For regular activity without retry:
ScheduleToStart timeout
ScheduleToClose timeout
StartToClose timeout
Heartbeat timeout
And then more values in retryOptions:
ExpirationInterval
InitialInterval
BackoffCoefficient
MaximumInterval
MaximumAttempts
And retryOptions can be applied onto localActivity or regular activity.
How do I use them together with what expectation?
TL;DR
The easiest way of using timeouts:
Regular Activity with retry:
Use StartToClose as timeout of each attempt
Leave ScheduleToStart and SchedueToClose empty
If StartToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
Regular Activity without retry:
Use ScheduleToClose for overall timeout
Leave ScheduleToStart and StartToClose empty
If ScheduleToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
LocalActivity without retry: Use ScheduleToClose for overall timeout
LocalActivity with retry:
Use ScheduleToClose as timeout of each attempt.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
More TL;DR
Because activity should be idempotent, all activity should set retry policy.
Temporal has set an infinite retry policy for any activity by default. Cadence should do the same IMO.
iWF also set default infinite retry for State APIs to match Temporal activity.
What and Why
Basics without Retry
Things are easier to understand in the world without retry. Because Cadence started from it.
ScheduleToClose timeout is the overall end-to-end timeout from a workflow's perspective.
ScheduleToStart timeout is the time that activity worker needed to start an activity. Exceeding this timeout, activity will return an ScheduleToStart timeout error/exception to workflow
StartToClose timeout is the time that an activity needed to run. Exceeding this will return
StartToClose to workflow.
Requirement and defaults:
Either ScheduleToClose is provided or both of ScheduleToStart and StartToClose are provided.
If only ScheduleToClose, then ScheduleToStart and StartToClose are default to it.
If only ScheduleToStart and StartToClose are provided, then ScheduleToClose = ScheduleToStart + StartToClose.
All of them are capped by workflowTimeout. (e.g. if workflowTimeout is 1hour, set 2 hour for ScheduleToClose will still get 1 hour :ScheduleToClose=Min(ScheduleToClose, workflowTimeout) )
So why are they?
You may notice that ScheduleToClose is only useful when
ScheduleToClose < ScheduleToStart + StartToClose. Because if ScheduleToClose >= ScheduleToStart+StartToClose the ScheduleToClose timeout is already enforced by the combination of the other two, and it become meaningless.
So the main use case of ScheduleToClose being less than the sum of two is that people want to limit the overall timeout of the activity but give more timeout for scheduleToStart or startToClose. This is extremely rare use case.
Also the main use case that people want to distinguish ScheduleToStart and StartToClose is that the workflow may need to do some special handling for ScheduleToStart timeout error. This is also very rare use case.
Therefore, you can understand why in TL;DR that I recommend only using ScheduleToClose but leave the other two empty. Because only in some rare cases you may need it. If you can't think of the use case, then you do not need it.
LocalActivity doesn't have ScheduleToStart/StartToClose because it's started directly inside workflow worker without server scheduling involved.
Heartbeat timeout
Heartbeat is very important for long running activity, to prevent it from getting stuck. Not only bugs can cause activity getting stuck, regular deployment/host restart/failure could also cause it. Because without heartbeat, Cadence server couldn't know whether or not the activity is still being worked on. See more details about here Solutions to fix stuck timers / activities in Cadence/SWF/StepFunctions
RetryOptions and Activity with Retry
First of all, here RetryOptions is for server side backoff retry -- meaning that the retry is managed automatically by Cadence without interacting with workflows. Because retry is managed by Cadence, the activity has to be specially handled in Cadence history that the started event can not written until the activity is closed. Here is some reference: Why an activity task is scheduled but not started?
In fact, workflow can do client side retry on their own. This means workflow will be managing the retry logic. You can write your own retry function, or there is some helper function in SDK, like Workflow.retry in Cadence-java-client. Client side retry will show all start events immediately, but there will be many events in the history when retrying for a single activity. It's not recommended because of performance issue.
So what do the options mean:
ExpirationInterval:
It replaces the ScheduleToClose timeout to become the actual overall timeout of the activity for all attempts.
It's also capped to workflow timeout like other three timeout options. ScheduleToClose = Min(ScheduleToClose, workflowTimeout)
The timeout of each attempt is StartToClose, but StartToClose defaults to ScheduleToClose like explanation above.
ScheduleToClose will be extended to ExpirationInterval:
ScheduleToClose = Max(ScheduleToClose, ExpirationInterval), and this happens before ScheduleToClose is copied to ScheduleToClose and StartToClose.
InitialInterval: the interval of first retry
BackoffCoefficient: self explained
MaximumInterval: maximum of the interval during retry
MaximumAttempts: the maximum attempts. If existing with ExpirationInterval, then retry stops when either one of them is exceeded.
Requirements and defaults:
Either MaximumAttempts or ExpirationInterval is required. ExpirationInterval is set to workflowTimeout if not provided.
Since ExpirationInterval is always there, and in fact it's more useful. Most of the time it's harder to use MaximumAttempts, because it's easily messed up with backoffCoefficient(e.g. when backoffCoefficient>1, the end to end timeout can be really large if not careful). So I would recommend just use ExpirationInterval. Unless you really need it.

Polling for external state transitions in Cadence workflows

I have a Cadence workflow where I need to poll an external AWS API until a particular resource transitions, which might take some amount of time. I assume I should make each individual 'checkStatus' request an Activity, and have the workflow perform the sleep/check loop. However, that means that I may have an unbounded number of activity calls in my workflow history. Is that worrisome? Is there a better way to accomplish this?
It depends on how frequently you want to poll.
For infrequent polls (every minute or slower) use the server side retry. Specify a RetryPolicy (or RetryOptions for Java) when invoking the activity. In the RetryPolicy specify an exponential coefficient of 1 and an initial interval of the poll frequency. Then fail the activity in case the polled resource is not ready and the server is going to retry it up to the specified retry policy expiration interval.
For very frequent polls of every few seconds or faster the solution is to implement the polling inside an activity implementation as a loop that polls and then sleeps for the poll interval. To ensure that the polling activity is restarted in a timely manner in case of worker failure/restart the activity has to heartbeat on every iteration. Use an appropriate RetryPolicy for the restarts of such failed activity.
In a rare case when the polling requires a periodic execution of a sequence of activities or activity arguments should change between retries a child workflow can be used. The trick is that a parent is not aware about a child calling continue as new. It only gets notified when a child completes or fails. So if a child executes the sequence of activities in a loop and calls continue as new periodically the parent is not affected until the child completes.

How to set Akka actors run only for specific time period?

I have a big task,which i break down into smaller task and analyse them. I have a basic model.
Master,worker and listener .
Master creates the tasks,give them to worker actors. Once an worker actor completes,it asks for another task from the master. Once all task is completed ,they inform the listener. They usually take around less than 2 minutes to complete 1000 tasks.
Now,Some time the time taken for some tasks might be more than others. I want to set timer for each task,and if a task takes more time,then worker task should be aborted by the master and the task has to be resubmitted later as new one. How to implement this? I can calculate the time taken by a worker task,but how Master actor keeps tab on time taken by all worker actors in real time?
One way of handling this would be for each worker, on receipt of a task to start on, sets a timeout before changing state to process the task, eg:
context.setReceiveTimeout(5 minutes) // for the '5 minutes' notation - import scala.concurrent.duration._
If the timeout is received, the worker can abort the task (or whatever other action you deem appropriate - eg. kill itself, or pass a notification message back to the master). Don't forget to cancel the timeout (set duration = Duration.Undefined) if the task is completed or the like.

zookeeper queue delay?

What would you guys suggest to be a good way to implement a queue in zookeeper that has the ability to delay a job without blocking a worker?
Reference beanstalkd delayed job option.
What you need is develop a Barriers using zookeeper.
I assume the "delay time" was set by another process called master.
Master first create a node say /work/flag with data "false"
What worker need to do is get and watch node /work/flag. The watcher would call back in asyn so you can do other thing in worker, would't block.
When the time comes, master would set the /work/flag data to "true", which cause a ZOO_CHANGED_EVENT event.
And the worker should get the event call back saying "ZOO_CHANGED_EVENT" in /work/flag. Then it can get and check if /work/flag is true and determine whether continue the workflow.