How to dynamically schedule tasks based on success and failure? - scheduled-tasks

I have a python program that I need to run once a day however the service itself is liable to many different kinds of errors like a lack of internet, etc.
I need a way to rerun the program based on if it fails and to keep retrying that program incrementally after failure.
For ex:
Failed
Rerun 5m after - Failed
Rerun 10m after - Failed
Rerun 20m after - Failed
etc... (until success)
Also if the computer was off I need a way for it to know if the program has already successfully run that day and to run it if it hasn't.

Related

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Airflow long running job killed after 1 hr but the task is still in running state

I need a help with a long running dag that keeps on failing after an hour but the task is still in running mode.
I have been using Airflow for the past 6-8 months. I with the help of our infrastructure team has setup Airflow in our company. It’s running on a AWS ECS cluster. The dags sit in an EFS instance with throughput set to provisioned. The logs are written in a s3 bucket.
For the worker aws ecs service we have an autoscaling policy that scales up the cluster at night 1 AM and scales down at 4AM.
It’s running fine for short duration jobs. It also was successful with a long duration job that was writing the results into a redshift table intermittently.
But now I have a job that is looping over a pandas dataframe and updating two dictionaries.
Issue:
It takes about 4 hrs for the job to finish but at around 1 hr it automatically fails without any error. The task still is in running mode until I manually stop it. And when I try to look at the logs the actual log doesn’t come up It shows
[2021-05-04 19:59:18,785] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: app-doctor-utilisation.execute 2021-05-04T18:57:10.480384+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2021-05-04 19:59:18,786] {local_task_job.py:90} INFO - Task is not able to be run
Now when I stop the task I can see some of the logs and the following logs at the end.
[2021-05-04 20:11:11,785] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 38
[2021-05-04 20:11:11,787] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-05-04 20:11:11,959] {helpers.py:291} INFO - Process psutil.Process(pid=38, status='terminated', exitcode=0, started='18:59:13') (38) terminated with exit code 0
[2021-05-04 20:11:11,960] {local_task_job.py:102} INFO - Task exited with return code 0
Can someone please help me figure out the issue and if there is any solution for this?

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

SSIS Transfer Objects task fails when run from Agent

I am using the SSIS Transfer Objects task to transfer a database from one server to another. This is a nightly task as the final part of ETL.
If I run the task manually during the day, there is no problem. It completes in around 60 to 90 minutes.
When I run the task from Agent, it always starts but often fails . I have the agent steps set up to rety on failure, but most nights it is taking 3 attempts. On some nights 5 or 6 attempts.
The error message returned is two fold (both error messages show in the log for the same row):-
1) An error occurred while transferring data. See the inner exception for details.
2) Timeout expired: The timeout period elapsed prior to completion of the operation or the server is not responding
I can't find any timeout limit to adjust that I haven't already adjusted.
Anyone have any ideas?