Jobs remain executing after Rundeck crashes/restarts - rundeck

Details:
App Version: Rundeck Community - v4.2.1 (though the issue below has
been around for some time)
Database: Postgresql 9.6
I'm wondering if anyone else has experienced this?
In the event of a rundeck crash or server crash jobs which were executing remain in a running state. Navigating to the execution shows the following message:
"Workflow State and Log Output is not available."
This causes issues since a running execution will block subsequent executions and are not alerted on etc.
Is there a config setting that can be used to force any jobs in a RUNNING state to fail in the event of a rundeck service crash/restart?
regards

Right now the way is to use Missed Job Fires feature with Job Resume (both features only available on PagerDuty Process Automation On Prem, formerly "Rundeck Enterprise") to resume missed jobs/steps after a Rundeck server failure.

Related

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

rundeck loses track of Job

I'm having an issue with rundeck.
I'm running rundeck version 2.10.4 with the default file-based data storage
For one particular environment, if the job takes longer than 1h ( approximative observation ), the jobs stays at "Running" forever.
After that 1h or so period, the jobs log aren't recovered anymore in the rundeck job's log output.
When the job eventualy ends on the distant server, I am forced to kill the job manually on rundeck.
I can't seem to find anything relevant in rundeck's log files
The jobs i'm running are on distant environments.
It doesn't happen on every environment. The one i'm having issues with is in a different network zone (don't know if it matters).
I was having the same situation with my former 2.6.1 version of rundeck.
any clue?
thanks

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

A Workload Scheduler's step doesn't proceed and remains queued status

I created a simple process by Applicaiton Lab interface in Bluemix Workload Scheduler. I ran my process, but the step didn't proceed and remained in queued status.
How can I proceed the step?
I executed the process by "Run now". The process doesn't have triggers
The step remains "Queued status".
The Process information
There is only one step. The step is "ping www.ibm.com"
Process doesn't have trigger. It is an on-demand process.
There might be a problem with the agent as I can successfully run a simple workload process without any issues. If you are using the Workload Automation Agent that is created for you then you will need to open a support ticket to have the Workload team look at that agent.
reviewing your question I think that a process submitted to the Workload Scheduler service should be a process which will complete: a ping command like the one you are trying to submit, will never complete if not 'killed' using CTRL+C (or called with [-c count] option)