Incorrect failure notification from Rundeck during fall time change - rundeck

Last night was "fall back" time change for most locations in the US. I woke up this morning to find dozens of job failure notifications. Almost all of them though were incorrect: the jobs showed as having completed normally, yet Rundeck sent a failure notification for it.
Interestingly, this happened in two completely separate Rundeck installations (v2.10.8-1 and v3.1.2-20190927). The commonality is that they're both on CentOS 7 (separate servers). They're both using MariaDB, although different versions of MariaDB.
The failure emails for the jobs that finished successfully showed a negative time in the "Scheduled after" line:
#1,811,391
by admin Scheduled after 59m at 1:19 AM
• Scheduled after -33s - View Output »
• Download Output
Execution
User: admin
Time: 59m
Started: in 59m 2019-11-03 01:19:01.0
Finished: 1s ago Sun Nov 03 01:19:28 EDT 2019
Executions Success rate Average duration
100% -45s
That job actually ran in 27s at 01:19 EDT (the first 1am hour, it is now EST). Looking at the email headers, I believe I got the message at 1:19 EST, an hour after the job ran.
So that would seem to imply to me that it's just a notification problem (somehow).
But there were a couple of jobs that were following other job executions that failed as well, apparently because the successfully finished job returned a RC 2. I'm not sure what to make of this.
We've been running Rundeck for a few years now, this is the first I remember seeing this problem. Of course my memory may be faulty--maybe we did see it previously, only there were fewer jobs affected or some such.
The fact that it impacted two different versions of Rundeck on two different servers implies either it's a fundamental issue with Rundeck that's been around for a while or it is something else in the operating system that's somehow causing problems for Rundeck. (Although time change isn't new, so that would seem to be somewhat surprising too.)
Any thoughts about what might have gone on (and how to prevent it next year, short of the obvious run on UTC) would be appreciated.

You can define specific Timezone in Rundeck, check this and this.

Related

Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

rundeck loses track of Job

I'm having an issue with rundeck.
I'm running rundeck version 2.10.4 with the default file-based data storage
For one particular environment, if the job takes longer than 1h ( approximative observation ), the jobs stays at "Running" forever.
After that 1h or so period, the jobs log aren't recovered anymore in the rundeck job's log output.
When the job eventualy ends on the distant server, I am forced to kill the job manually on rundeck.
I can't seem to find anything relevant in rundeck's log files
The jobs i'm running are on distant environments.
It doesn't happen on every environment. The one i'm having issues with is in a different network zone (don't know if it matters).
I was having the same situation with my former 2.6.1 version of rundeck.
any clue?
thanks

Chronos + Mesosphere. How to execute tasks in parallel?

Good day everyone.
I have single server for Chronos, Mesos and Zookeeper, and i want to use Chronos as something, what will run my scripts daily. Some scripts today, some tomorrow and so on..
The problem is when i'm trying to launch tasks one after another, only first one executes correctly, another one is lost somewhere. If i launch first then take a pause of 3-4 seconds and launch another - they both are launched, but sequentially.
And i need to run them in parallel.
Can someone provide a hint on this? Maybe there is some settings that i must change?
You should set a time in UTC time for both tasks to be launched with a repeating period of 24 hours. In this case, there is no reason why your tasks should not execute in parallel. Check the chronos logs and the tasks logs in sandbox on mesos for errors.
You can certainly run all of these components (Chronos, master, slave, and ZK) on the same machine, although ZK really becomes valuable once you have HA with multiple masters.
As user4103259 suggested, check the master and slave logs for that LOST/failed taskId to see what exactly happened to it. A task could go LOST/failed for numerous reasons, anywhere along the task launch/running/completing process.

Task Scheduler 6 days a week

I have a backup that I run as a scheduled task every night, then clear out every weekday morning after I check it.
Problem is my server's memory is almost maxed out (can't upgrade with my current host, and still researching others), and the weekend backups are leaving me with no memory by Mon morning.
Is there a way to have Windows Task Scheduler run 6 days a week instead of 7?
if you select weekly you can select which days your task shall run.