RunDeck: retry failed job but only for those nodes which has failed - rundeck

I need to run ansible playbooks on a set of hosts (RunDeck nodes). But those nodes are often not reachable (IOT/home devices) and would like to have a job which is executing the following logic:
tries to execute specific playbook on a set of 100 nodes
retries only for those nodes for which it has failed (do not rerun for all 100 nodes again)
keep running forever to make sure that job is executed on all required nodes
Now: option https://docs.rundeck.com/docs/manual/creating-jobs.html#retry does seem to restart the whole job - so it's not what i want to achieve. Correct ? Is there any way of achieving the above ? (i run another/similar jobs on Apache Airflow and there i have a very good ability to retry only failed tasks)
Thanks,

Set your job to run parallelly in all nodes (edit your job, go to the nodes tab, scroll down to the "Thread Count" section, and set the number of nodes), then at the moment of failure on some nodes, you can run again only in failed nodes.

Related

Airflow tasks failing with SIGTERM when worker pod downscaling

I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.

Slurm: How to restart failed worker job

If one is running an array job on a slurm cluster, how can one restart a failed worker job?
In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?
You can use --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
See more here: https://slurm.schedmd.com/sbatch.html#lbAE

Can Rundeck Be Configured to Use Different Nodes If Someone Else is Already Running A Job On Current Node?

I am trying to see if rundeck is capable of deciding if a node is currently busy runnning another job and it will switch to another node and ruin the job there instead.
For example, I am current running a job on NODE1, then another person logs into rundeck and decide to run their job on NODE1, but NODE1 is busy running my job so rundeck will automatically run their job on NODE2.
Thanks
This could be possible assuming the following design:
1) Create/use a resource model source plugin that sets attributes about the node's business. This can be a metric like load status or something else that you use to gauge rundeck utilization.
2) Write the job with a nodefilter to match on the attribute about utilization metric
3 ) Define the job to use the RandomSubset Orchestration strategy specifying to use 1 node.

Gracefully update running celery pod in Kubernetes

I have a Kubernetes cluster running Django, Celery, RabbitMq and Celery Beat. I have several periodic tasks spaced out throughout the day (so as to keep server load down). There are only a few hours when no tasks are running, and I want to limit my rolling-updates to those times, without having to track it manually. So I'm looking for a solution that will allow me to fire off a script or task of some sort that will monitor the Celery server, and trigger a rolling update once there's a window in which no tasks are actively running. There are two possible ways I thought of doing this, but I'm not sure which is best, nor how to implement either one.
Run a script (bash or otherwise) that checks up on the Celery server every few minutes, and initiates the rolling-update if the server is inactive
Increment the celery app name before each update (in the Beat run command, the Celery run command, and in the celery.py config file), create a new Celery pod, rolling-update the Beat pod, and then delete the old Celery 12 hours later (a reasonable time span for all running tasks to finish)
Any thoughts would be greatly appreciated.

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.