spring batch job should wait for prometheus to scrape before shutdown - spring-batch

We have spring batch jobs running with prometheus as our monitoring system.
When the job is running metrics are being collected just fine, but when the job finish it shutdowns before prometheus manage to collect the metric of "spring.batch.job" which is crucial because it has 'duration' and 'status' tag which indicate wether the job succeeded or failed.
How can I program the job to wait for 'one last scrape' after its done before its shutdown?

You can define what action you can do before shutdown using the configuration param in your application.yml
management.metrics.export.prometheus.pushgateway.shutdown-operation
and possible options are listed in org.springframework.boot.actuate.metrics.export.prometheus.PrometheusPushGatewayManager$ShutdownOperation

Related

Spring Batch task is getting shutdown before RSocket proxy scrapes metrics from it

I am using Spring Cloud Dataflow to launch batch jobs that are wrapped in Spring Cloud task and we also have configured Prometheus and Rsocket proxy to scrape task related metrics. To make the tasks shutdown, I am using spring.cloud.task.closecontext_enabled=true. But we observed that few metrics are not getting scraped because of this closecontext flag(if it is set to false, metrics are scraped as intended but the pods are not getting shutdown so increase in TCO).
I suspect that the rsocket proxy is unable to get the metrics before the tasks get shutdown. Can someone know this issue and can show some pointers in solving this issue?

Apache Flink job is not scheduled on multiple TaskManagers in Kubernetes (replicas)

I have a simple Flink job that reads from an ActiveMQ source & sink to a database & print. I deploy the job in the Kubernetes with 2 TaskManagers, each having Task Slots of 10 (taskmanager.numberOfTaskSlots: 10). I configured parallelism more than the total TaskSlots available (ie., 10 in this case).
When I see the Flink Dashboard I see this job runs only in one of the TaskManager, but the other TaskManager has no Jobs. I verified this by checking every operator where it is scheduled, also in Task Manager UI page one of the manager has all slots free. I attach below images for reference.
Did I configure anything wrong? Where is the gap in my understanding? And can someone explain it?
Job
The first task manager has enough slots (10) to fully satisfy the requirements of your job.
The scheduler's default behavior is to fully utilize one task manager's slots before using slots from another task manager. If instead you would prefer that Flink spread out the workload across all available task managers, set cluster.evenly-spread-out-slots: true in flink-conf.yaml. (This option was added in Flink 1.10 to recreate a scheduling behavior similar to what was the default before Flink 1.5.)

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

How to run kubernetes pod for a set period of time each day?

I'm looking for a way to deploy a pod on kubernetes to run for a few hours each day. Essentially I want it to run every morning at 8AM and continue running until about 5:30 PM.
I've been researching a lot and haven't found a way to deploy the pod with a specific timeframe in mind. I've found cron jobs, but that seems to be to be for pods that terminate themselves, whereas mine should be running constantly.
Is there any way to deploy my pod on kubernetes this way? Or should I just set up the pod itself to run its intended application based on its internal clock?
According to the Kubernetes architecture, a Job creates one or more pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the job tracks the successful completions. When a specified number of successful completions is reached, the job itself is complete.
In simple words, Jobs run until completion or failure. That's why there is no option to schedule a Cron Job termination in Kubernetes.
In your case, you can start a Cron Job regularly and terminate it using one of the following options:
A better way is to terminate a container by itself, so you can add such functionality to your application or use Cron. More information about how to add Cron to the Docker container, you can find here.
You can use another Cron Job to terminate your Cron Job. You need to run a command inside a Pod to find and delete a Pod related to your Job. For more information, you can look through this link. But it is not a good way, because your Cron Job will always have failed status.
In both cases, you need to check with what status your Cron Job was finished and use the correct RestartPolicy accordingly.
It seems you can implement using a cronjob object,
[ https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#creating-a-cron-job ]

Can Rundeck Be Configured to Use Different Nodes If Someone Else is Already Running A Job On Current Node?

I am trying to see if rundeck is capable of deciding if a node is currently busy runnning another job and it will switch to another node and ruin the job there instead.
For example, I am current running a job on NODE1, then another person logs into rundeck and decide to run their job on NODE1, but NODE1 is busy running my job so rundeck will automatically run their job on NODE2.
Thanks
This could be possible assuming the following design:
1) Create/use a resource model source plugin that sets attributes about the node's business. This can be a metric like load status or something else that you use to gauge rundeck utilization.
2) Write the job with a nodefilter to match on the attribute about utilization metric
3 ) Define the job to use the RandomSubset Orchestration strategy specifying to use 1 node.