Is there a way to manually retry a step in Argo DAG workflow? - argo-workflows

Argo UI shows a "Retry" button for DAG workflows but if a step fails and I use it to retry it always fails to retry. Is manual retry even supported in Argo?

Related

Github Actions Concurrency Queue

Currently we are using Github Actions for CI for infrastructure.
Infrastructure is using terraform and a code change on a module triggers plan and deploy for changed module only (hence only updates related modules, e.g 1 pod container)
Since auto-update can be triggered by another github repository push they can come relatively on same time frame, e.g Pod A Image is updated and Pod B Image is updated.
Without any concurrency in place, since terraform holds lock, one of the actions will fail due to lock timeout.
After implementing concurreny it is ok for just 2 on same time pushes to deploy as second one can wait for first one to finish.
Yet if there are more coming, Githubs concurreny only takes into account last push for queue and cancels waiting ones (in progress one can still continue). This is logical from single app domain perspective but since our Infra code is using difference checks, by passing deployments on canceled job actually bypasses and deployment!.
Is there a mechanism where we can queue workflows (or even maybe give a queue wait timeout) on Github Actions ?
Eventually we wrote our own script in workflow to wait for previous runs
Get information on current run
Collect previous non completed runs
and wait until completed (in a loop)
If exited waiting loop continue
on workflow
Tutorial on checking status of workflow jobs
https://www.softwaretester.blog/detecting-github-workflow-job-run-status-changes

Notify completion of argo workflow

I have a use case where I am triggering argo workflow from a python application. However, I need a mechanism from argo workflow that it should notify my python application when the workflow execution is completed. I am already using a pub sub mechansim in my python application. So want my python app to subscribe to a redis queue and take action once workflow publishes a message on this queue informing its completion.
This is the interaction flow I am looking for
Workflow ——-> Redis queue ——> Python app
Thanks for help
you can use Argo Workflows's ExitHandler:
https://argoproj.github.io/argo-workflows/variables/#exit-handler
https://github.com/argoproj/argo-workflows/blob/master/examples/exit-handlers.yaml

For an activity called in a loop, does the retry policy for the activity apply to each run?

For a given workflow with activity A with max retries set to 3, if I have the following piece of code:
for (String type: types) {
activityA.process(type);
}
and types in this case is ["type1", "type2", "type3"]
So if activityA processed type1 successfully and starts processing type2 and fails for some reason,
Will the retry policy for activityA apply each time a type is run or will it be 3 retries across all activity types?
If the workflow fails when executing type2, will the workflow restart from the beginning and process type1 again or will it start from type2?
For 1. The retry policy will be working independently for each activity. So each type will have three retries.
For 2. Workflow failure is a terminal state for workflow execution. It would not retry automatically unless you specify a retry policy when starting the workflow. When workflow retry. It will start from very beginning.
See also https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
Or maybe what you asked is if worker fails instead of workflow? Cadence is very fault tolerant to worker failure, workflow will automatically resume running from what it has left before the previous worker dies.
See also https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism

Is there a way to configure retries for Azure DevOps pipeline tasks or jobs?

Currently I have a OneBranch DevOps pipeline that fails every now and then while restoring packages. Usually it fails because of some transient error like a socket exception or timeout. Re-trying the job usually fixes the issue.
Is there a way to configure a job or task to retry?
Azure Devops now supports the retryCountOnTaskFailure setting on a task to do just this.
See this page for further information:
https://learn.microsoft.com/en-us/azure/devops/release-notes/2021/pipelines/sprint-195-update
Update:
Automatic retries for a task was added and when you read this it should be available for usage.
It can be used as follow:
- task: <name of task>
retryCountOnTaskFailure: <max number of retries>
...
Here are a few things to note when using retries:
The failing task is retried immediately.
There is no assumption about the idempotency of the task. If the task has side-effects (for instance, if it created an external resource partially), then it may fail the second time it is run.
There is no information about the retry count made available to the task.
A warning is added to the task logs indicating that it has failed before it is retried.
All of the attempts to retry a task are shown in the UI as part of the same task node.
Original answer:
There is no way of doing that with native tasks. However, if you can script then you can put such logic inside.
You could do this for instance in this way:
n=0
until [ "$n" -ge 5 ]
do
command && break # substitute your command here
n=$((n+1))
sleep 15
done
However there is no native way of doing this for regular tasks.
Automatically retry a task in on roadmap so it could change in near future.

Argo Workflows semaphore with value 0

I'm learning about semaphores in the Argo project workflows to avoid concurrent workflows using the same resource.
My use case is that I have several external resources which only one workflow can use one at a time. So far so good, but sometimes the resource needs some maintenance and during that period I don't want to Argo to start any workflow.
I guess I have two options:
I tested manually setting the semaphore value in the configMap to the value 0, but Argo started one workflow anyway.
I can start a workflow that runs forever, until it is deleted, claiming the synchronization lock, but that requires some extra overhead to have workflows running that don't do anything.
So I wonder how it is supposed to work if I set the semaphore value to 0, I think it should not start the workflow then since it says 0. Anyone have any info about this?
This is the steps I carried out:
First I apply my configmap with kubectl -f.
I then submit some workflows and since they all use the same semaphore Argo will start one and the rest will be executed in order one at a time.
I then change value of the semapore with kubectl edit configMap
Submit new job which then Argo will execute.
Perhaps Argo will not reload the configMap when I update the configMap through kubectl edit? I would like to update the configmap programatically in the future but used kubectl edit now for testing.
Quick fix: after applying the ConfigMap change, cycle the workflow-controller pod. That will force it to reload semaphore state.
I couldn't reproduce your exact issue. After using kubectl edit to set the semaphore to 0, any newly submitted workflows remained Pending.
I did encounter an issue where using kubectl edit to bump up the semaphore limit did not automatically kick off any of the Pending workflows. Cycling the workflow controller pod allowed the workflows to start running again.
Besides using the quick fix, I'd recommend submitting an issue. Synchronization is a newer feature, and it's possible it's not 100% robust yet.