AWS Fargate vs Batch vs ECS for a once a day batch process - amazon-ecs

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?

We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.

may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Related

NestJS schedualers are not working in production

I have a BE service in NestJS that is deployed in Vercel.
I need several schedulers, so I have used #nestjs/schedule lib, which is super easy to use.
Locally, everything works perfectly.
For some reason, the only thing that is not working in my production environment is those schedulers. Everything else is working - endpoints, data base access..
Does anyone has an idea why? is it something with my deployment? maybe Vercel has some issue with that? maybe this schedule library requires something the Vercel doesn't have?
I am clueless..
Cold boot is the process of starting a computer from shutdown or a powerless state and setting it to normal working condition.
Which means that the code you deployed in a serveless manner, will run when the endpoint is called. The platform you are using spins up a virtual machine, to execute your code. And keeps the machine running for a certain period of time, incase you get another API hit, it's cheaper and easier on them to keep the machine running for lets say 5 minutes or 60 seconds, than to redeploy it on every call after shutting the machine when function execution ends.
So in your case, most likely what is happening is that the machine that you are setting the cron on, is killed after a period of time. Crons are system specific tasks which run in the kernel. But if the machine is shutdown, the cron dies with it. The only case where the cron would run, is if the cron was triggered at a point of time, before the machine was shut down.
Certain cloud providers give you the option to keep the machines alive. I remember google cloud used to follow the path of that if a serveless function is called frequently, it shifts from cold boot to hot start, which doesn't kill the machine entirely, and if you have traffic the machines stay alive.
From quick research, vercel isn't the best to handle crons, due to the nature of the infrastructure, and this is what you are looking for. In general, crons aren't for serveless functions. You can deploy the crons using queues for example or another third party service, check out this link by vercel.

How to create a cron job in a Kubernetes deployed app without duplicates?

I am trying to find a solution to run a cron job in a Kubernetes-deployed app without unwanted duplicates. Let me describe my scenario, to give you a little bit of context.
I want to schedule jobs that execute once at a specified date. More precisely: creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization.
My application is running inside a Kubernetes cluster, and I cannot assume that there always will be only one instance of it running at the any moment in time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job runs exactly once in the whole cluster.
I tried to find solutions for this problem and came up with the following ideas.
Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
Not possible in my case, since the duplicate jobs might run on other machines!
Utilize the Kubernetes CronJob API.
I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean (Client Version: v1.22.4, Server Version: v1.21.5).
After thinking about a solution for a rather long time I found it.
The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date and time.
The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separate database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run into very similar problems again. However, if you think about what is the actual work to be done in that service, you realize that it is very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Spring boot scheduler running cron job for each pod

Current Setup
We have kubernetes cluster setup with 3 kubernetes pods which run spring boot application. We run a job every 12 hrs using spring boot scheduler to get some data and cache it.(there is queue setup but I will not go on those details as my query is for the setup before we get to queue)
Problem
Because we have 3 pods and scheduler is at application level , we make 3 calls for data set and each pod gets the response and pod which processes at caches it first becomes the master and other 2 pods replicate the data from that instance.
I see this as a problem because we will increase number of jobs for get more datasets , so this will multiply the number of calls made.
I am not from Devops side and have limited azure knowledge hence I need some help from community
Need
What are the options available to improve this? I want to separate out Cron schedule to run only once and not for each pod
1 - Can I keep cronjob at cluster level , i have read about it here https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Will this solve a problem?
2 - I googled and found other option is to run a Cronjob which will schedule a job to completion, will that help and not sure what it really means.
Thanks in Advance to taking out time to read it.
Based on my understanding of your problem, it looks like you have following two choices (at least) -
If you continue to have scheduling logic within your springboot main app, then you may want to explore something like shedlock that helps make sure your scheduled job through app code executes only once via an external lock provider like MySQL, Redis, etc. when the app code is running on multiple nodes (or kubernetes pods in your case).
If you can separate out the scheduler specific app code into its own executable process (i.e. that code can run in separate set of pods than your main application code pods), then you can levarage kubernetes cronjob to schedule kubernetes job that internally creates pods and runs your application logic. Benefit of this approach is that you can use native kubernetes cronjob parameters like concurrency and few others to ensure the job runs only once during scheduled time through single pod.
With approach (1), you get to couple your scheduler code with your main app and run them together in same pods.
With approach (2), you'd have to separate your code (that runs in scheduler) from overall application code, containerize it into its own image, and then configure kubernetes cronjob schedule with this new image referring official guide example and kubernetes cronjob best practices (authored by me but can find other examples).
Both approaches have their own merits and de-merits, so you can evaluate them to suit your needs best.

Jenkins trigger job by another which are running on offline node

Is there any way to do the following:
I have 2 jobs. One job on offline node has to trigger the second one. Are there any plugins in Jenkins that can do this. I know that TeamCity has a way of achieving this, but I think that Jenkins is more constrictive
When you configure your node, you can set Availability to Take this slave on-line when in demand and off-line when idle.
Set Usage as Leave this machine for tied jobs only
Finally, configure the job to be executed only on that node.
This way, when the job goes to queue and cannot execute (because the node is offline), Jenkins will try to bring this node online. After the job is finished, the node will go back to offline.
This of course relies on the fact that Jenkins is configured to be able to start this node.
One instance will always be turn on, on which the main job can be run. And have created the job which will look in DB and if in the DB no running instances, it will prepare one node. And the third job after running tests will clean up my environment.