For an activity called in a loop, does the retry policy for the activity apply to each run? - cadence-workflow

For a given workflow with activity A with max retries set to 3, if I have the following piece of code:
for (String type: types) {
activityA.process(type);
}
and types in this case is ["type1", "type2", "type3"]
So if activityA processed type1 successfully and starts processing type2 and fails for some reason,
Will the retry policy for activityA apply each time a type is run or will it be 3 retries across all activity types?
If the workflow fails when executing type2, will the workflow restart from the beginning and process type1 again or will it start from type2?

For 1. The retry policy will be working independently for each activity. So each type will have three retries.
For 2. Workflow failure is a terminal state for workflow execution. It would not retry automatically unless you specify a retry policy when starting the workflow. When workflow retry. It will start from very beginning.
See also https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
Or maybe what you asked is if worker fails instead of workflow? Cadence is very fault tolerant to worker failure, workflow will automatically resume running from what it has left before the previous worker dies.
See also https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism

Related

ADF Scheduling when existing Job not yet finished

Having read https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-scheduling-and-execution, it is unclear to me if:
A schedule is made every hr for a job to run,
can we stop the concurrent execution of the next job at hr+1 if the job for hr+0 is still running?
It looks if concurrency = 1 means this,
But is that invocation simply not start until concurrent execution is finished?
Or will it be discarded?
When we set the concurrency 1, only one instance will be allowed to run at a time. When the scheduled trigger runs again and tries to run the pipeline, If the pipeline is already running, the next invocation will be queued. It will start after finishing the current instance.
For your question, the following invocation will be queued. After the first run finishes, the next run will start.

Is there a way to configure retries for Azure DevOps pipeline tasks or jobs?

Currently I have a OneBranch DevOps pipeline that fails every now and then while restoring packages. Usually it fails because of some transient error like a socket exception or timeout. Re-trying the job usually fixes the issue.
Is there a way to configure a job or task to retry?
Azure Devops now supports the retryCountOnTaskFailure setting on a task to do just this.
See this page for further information:
https://learn.microsoft.com/en-us/azure/devops/release-notes/2021/pipelines/sprint-195-update
Update:
Automatic retries for a task was added and when you read this it should be available for usage.
It can be used as follow:
- task: <name of task>
retryCountOnTaskFailure: <max number of retries>
...
Here are a few things to note when using retries:
The failing task is retried immediately.
There is no assumption about the idempotency of the task. If the task has side-effects (for instance, if it created an external resource partially), then it may fail the second time it is run.
There is no information about the retry count made available to the task.
A warning is added to the task logs indicating that it has failed before it is retried.
All of the attempts to retry a task are shown in the UI as part of the same task node.
Original answer:
There is no way of doing that with native tasks. However, if you can script then you can put such logic inside.
You could do this for instance in this way:
n=0
until [ "$n" -ge 5 ]
do
command && break # substitute your command here
n=$((n+1))
sleep 15
done
However there is no native way of doing this for regular tasks.
Automatically retry a task in on roadmap so it could change in near future.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

DecisionTaskTimedOut before the specified timeout

I have a case when decision times out after 5 seconds when timeout is set to 10:
17 2019-06-13T17:46:59Z DecisionTaskScheduled {TaskList:{Name:maxim-C02XD0AAJGH6:db09fd84-98bf-4546-a0d8-fb51e30c2b41},
StartToCloseTimeoutSeconds:10, Attempt:0}
18 2019-06-13T17:47:04Z DecisionTaskTimedOut {ScheduledEventId:17,
StartedEventId:0,
TimeoutType:SCHEDULE_TO_START}
10:49 AM
It is using Cadence service running in a local docker and I can reproduce it reliably.
The 5s timeout is due to Cadence Sticky Execution feature. Sticky Execution is enabled by default on Cadence Worker which allows the workflow state to be cached on the worker after responding back with decisions. This allows Cadence server to directly dispatch new decision tasks to the same worker which allows to reuse the cached state and produce new decisions without replaying the entire execution history.
Decision SCHEDULE_TO_START timeout is put in place to allow decision to be sent to another worker when worker restarts and there is no poller on the sticky tasklist for a workflow execution. This causes the stickyness to be cleared by Cadence server for that execution and decision dispatched to original tasklist so it can be picked up by any other worker.
// Optional: Sticky schedule to start timeout.
// default: 5s
// The resolution is seconds. See details about StickyExecution on the comments for DisableStickyExecution.
StickyScheduleToStartTimeout time.Duration

Spring batch jobOperator - how are multiple concurrent instances of a job from the same XML file controlled?

When we run multiple concurrent jobs with different parameters, how can we control (stop, restart) the appropriate jobs? Our internal code provides the jobExecution object, but under the covers The jobOperator uses the job name to get the job instance.
In our case all of the jobs are from "do-stuff.xml" (okay, it's sanitized and not very original). After looking at the spring-batch source code, our concern is that if there is more then one job running and we stop a job it will take the most recently submitted job and stop it.
The JobOperator will allow you to fetch all running executions of the job using getRunningExecutions(String jobName). You should be able to iterate over that list to find the one you want. Then, just call stop(long executionId) on the one you want.
Alternatively, we've also implemented listeners (both at step and chunk level) to check an outage status table. When we want to implement a system-wide outage, we add the outage there and have our listener throw an exception to bring our jobs down. once the outage is lifted, all "failed" executions may be restarted.