How do I view the log of a previous failure for an OpenShift Job? - kubernetes

I have a Job that runs daily with its container having a restartPolicy of onFailure. It failed once, then ran again about 40 mins later. We need to know the reason for the failure, but oc logs only brings up the log of the successful run.
Is there a way to bring up the log for the previous failed run?

Related

Airflow long running job killed after 1 hr but the task is still in running state

I need a help with a long running dag that keeps on failing after an hour but the task is still in running mode.
I have been using Airflow for the past 6-8 months. I with the help of our infrastructure team has setup Airflow in our company. It’s running on a AWS ECS cluster. The dags sit in an EFS instance with throughput set to provisioned. The logs are written in a s3 bucket.
For the worker aws ecs service we have an autoscaling policy that scales up the cluster at night 1 AM and scales down at 4AM.
It’s running fine for short duration jobs. It also was successful with a long duration job that was writing the results into a redshift table intermittently.
But now I have a job that is looping over a pandas dataframe and updating two dictionaries.
Issue:
It takes about 4 hrs for the job to finish but at around 1 hr it automatically fails without any error. The task still is in running mode until I manually stop it. And when I try to look at the logs the actual log doesn’t come up It shows
[2021-05-04 19:59:18,785] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: app-doctor-utilisation.execute 2021-05-04T18:57:10.480384+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2021-05-04 19:59:18,786] {local_task_job.py:90} INFO - Task is not able to be run
Now when I stop the task I can see some of the logs and the following logs at the end.
[2021-05-04 20:11:11,785] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 38
[2021-05-04 20:11:11,787] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-05-04 20:11:11,959] {helpers.py:291} INFO - Process psutil.Process(pid=38, status='terminated', exitcode=0, started='18:59:13') (38) terminated with exit code 0
[2021-05-04 20:11:11,960] {local_task_job.py:102} INFO - Task exited with return code 0
Can someone please help me figure out the issue and if there is any solution for this?

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

What does shutdown look like for a kubernetes cron job pod when it's being terminated by "replace" concurrency policy?

I couldn't find anything in the official kubernetes docs about this. What's the actual low-level process for replacing a long-running cron job? I'd like to understand this so my application can handle it properly.
Is it a clean SIGHUP/SIGTERM signal that gets sent to the app that's running?
Is there a waiting period after that signal gets sent, so the app has time to cleanup/shutdown before it potentially gets killed? If so, what's that timeout in seconds? Or does it wait forever?
For reference, here's the Replace policy explanation in the docs:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
Concurrency Policy
Replace: If it is time for a new job run and the previous job run hasn’t finished yet, the cron job replaces the currently running job run with a new job run
A CronJob has just another Pod underneath.
When a Cronjob with a concurrency policy of "Replace" is still active, the Job will be deleted, which also deletes the Pod.
When a Pod is deleted the Linux container/s will be sent a SIGTERM and then a SIGKILL after a grace period, defaulted to 30 seconds. The terminationGracePeriodSeconds property in a PodSpec can be set to override that default.
Due to the flag added to the DeleteJob call, it sounds like this delete is only deleting values from the kube key/value store. Which would mean the new Job/Pod could be created while the current Job/Pod is still terminating. You could confirm with a Job that doesn't respect a SIGTERM and has a terminationGracePeriodSeconds set to a few times your clusters scheduling speed.

Restarting a Spring batch job after app server failure or spring batch repository DB failure?

When spring batch DB failure happens or server is shut down, a spring batch job which was running at that time will be in a unknown started state.
In spring batch admin, we will not see an option to restart the job. Hence we are not able to resume the job.
How to restart the job from last successful commit?
The old discussions suggest that it had to be dealt manually by updating tables. I was manually able to update end time, status in batch step execution and batch job execution tables. Is it really the best option? It may not be practical to do that manually in a prod region.
As mentioned in the Aborting a Job section of the reference documentation, when a server failure happens, the job repository has no way to know that the process running the job died. Hence a manual intervention is required.
How to restart the job from last successful commit?
Change the status of the job to FAILED and restart the job instance, it should continue from where it left off.

How to run kubernetes pod for a set period of time each day?

I'm looking for a way to deploy a pod on kubernetes to run for a few hours each day. Essentially I want it to run every morning at 8AM and continue running until about 5:30 PM.
I've been researching a lot and haven't found a way to deploy the pod with a specific timeframe in mind. I've found cron jobs, but that seems to be to be for pods that terminate themselves, whereas mine should be running constantly.
Is there any way to deploy my pod on kubernetes this way? Or should I just set up the pod itself to run its intended application based on its internal clock?
According to the Kubernetes architecture, a Job creates one or more pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the job tracks the successful completions. When a specified number of successful completions is reached, the job itself is complete.
In simple words, Jobs run until completion or failure. That's why there is no option to schedule a Cron Job termination in Kubernetes.
In your case, you can start a Cron Job regularly and terminate it using one of the following options:
A better way is to terminate a container by itself, so you can add such functionality to your application or use Cron. More information about how to add Cron to the Docker container, you can find here.
You can use another Cron Job to terminate your Cron Job. You need to run a command inside a Pod to find and delete a Pod related to your Job. For more information, you can look through this link. But it is not a good way, because your Cron Job will always have failed status.
In both cases, you need to check with what status your Cron Job was finished and use the correct RestartPolicy accordingly.
It seems you can implement using a cronjob object,
[ https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#creating-a-cron-job ]