I have a golang server which is used to deploy AI model trainings into kubernetes that every training would run their job on a pod. After job is completed, my server need to upload the model output to HDFS/S3.
So I need a post-container to process the upload task like init-container doing init task on k8s pod.
For now, i use a tricky way that adding the model job container into init-containers and running the upload task container in containers. This works if there's no error thrown in init-containers. However, if there's errors in init-conttainer the pod status is Init:ContainerCannotRun which should be Failed in normal.
I know i can attach a preStop command to container lifecycle events if the image container the upload-hdfs/s3 command tool. However I do not want to let the model training images to include these commands. So this is not my answer.
So my question is that how to implement a post process container so that i can run the upload task after job is completed?
I also find a related issue in github, i would try it if there's no other choice.
There are two ways in Kubernetes to cope with it - preStop hook and the code which will listen for the SIGTERM signal.
Also, you can increase the termination grace period to have more time before pod will be terminated.
More information you can find in Attach Handlers to Container Lifecycle Events and Kubernetes best practices: terminating with grace.
Related
Current Setup
We have kubernetes cluster setup with 3 kubernetes pods which run spring boot application. We run a job every 12 hrs using spring boot scheduler to get some data and cache it.(there is queue setup but I will not go on those details as my query is for the setup before we get to queue)
Problem
Because we have 3 pods and scheduler is at application level , we make 3 calls for data set and each pod gets the response and pod which processes at caches it first becomes the master and other 2 pods replicate the data from that instance.
I see this as a problem because we will increase number of jobs for get more datasets , so this will multiply the number of calls made.
I am not from Devops side and have limited azure knowledge hence I need some help from community
Need
What are the options available to improve this? I want to separate out Cron schedule to run only once and not for each pod
1 - Can I keep cronjob at cluster level , i have read about it here https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Will this solve a problem?
2 - I googled and found other option is to run a Cronjob which will schedule a job to completion, will that help and not sure what it really means.
Thanks in Advance to taking out time to read it.
Based on my understanding of your problem, it looks like you have following two choices (at least) -
If you continue to have scheduling logic within your springboot main app, then you may want to explore something like shedlock that helps make sure your scheduled job through app code executes only once via an external lock provider like MySQL, Redis, etc. when the app code is running on multiple nodes (or kubernetes pods in your case).
If you can separate out the scheduler specific app code into its own executable process (i.e. that code can run in separate set of pods than your main application code pods), then you can levarage kubernetes cronjob to schedule kubernetes job that internally creates pods and runs your application logic. Benefit of this approach is that you can use native kubernetes cronjob parameters like concurrency and few others to ensure the job runs only once during scheduled time through single pod.
With approach (1), you get to couple your scheduler code with your main app and run them together in same pods.
With approach (2), you'd have to separate your code (that runs in scheduler) from overall application code, containerize it into its own image, and then configure kubernetes cronjob schedule with this new image referring official guide example and kubernetes cronjob best practices (authored by me but can find other examples).
Both approaches have their own merits and de-merits, so you can evaluate them to suit your needs best.
I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...
I'm looking for a way to deploy a pod on kubernetes to run for a few hours each day. Essentially I want it to run every morning at 8AM and continue running until about 5:30 PM.
I've been researching a lot and haven't found a way to deploy the pod with a specific timeframe in mind. I've found cron jobs, but that seems to be to be for pods that terminate themselves, whereas mine should be running constantly.
Is there any way to deploy my pod on kubernetes this way? Or should I just set up the pod itself to run its intended application based on its internal clock?
According to the Kubernetes architecture, a Job creates one or more pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the job tracks the successful completions. When a specified number of successful completions is reached, the job itself is complete.
In simple words, Jobs run until completion or failure. That's why there is no option to schedule a Cron Job termination in Kubernetes.
In your case, you can start a Cron Job regularly and terminate it using one of the following options:
A better way is to terminate a container by itself, so you can add such functionality to your application or use Cron. More information about how to add Cron to the Docker container, you can find here.
You can use another Cron Job to terminate your Cron Job. You need to run a command inside a Pod to find and delete a Pod related to your Job. For more information, you can look through this link. But it is not a good way, because your Cron Job will always have failed status.
In both cases, you need to check with what status your Cron Job was finished and use the correct RestartPolicy accordingly.
It seems you can implement using a cronjob object,
[ https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#creating-a-cron-job ]
I am creating the deployments/services using REST APIs. I send POST request with bodies which contain the JSON objects which create the applications on Openshift. After I call all the APIs, these objects get instantiated.
I have 2 deployments which are dependent on mongodb deployment but this mongodb takes a little longer to start running, while the two deployments which are dependent on mongodb start running earlier. This breaks the code inside the 2 deployments as the mongodb connection fails(since it is not up yet).
There could be 2 possible way I can fix this problem.
I put a delay after i create mongodb deployment and recursively call the API to check it's status if it is running or not.
Just like we make changes in docker-compose, with the key, depends-on which tell the docker-compose that all the dependencies should be started first and then the dependent container.
Is there any way this could be achieved in openshift?
Instead of implementing complex logic for dependency handling, use health checking mechanism of Kubernetes. If your application starts and doesn't see Mongo DB, let it crash. Kubernetes will keep restarting it until Mongo DB comes online, and your application becomes healthy and serving as well. Kubernetes won't send traffic to not yet healthy instances.
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
Just like we make changes in docker-compose, with the key, depends-on which tell the docker-compose that all the dependencies should be started first and then the dependent container.
You might want to look into Init Containers for dependent container. They run to completion before container is actually started. Below excerpt is taken from referenced documentation (given below) for use cases that might be applicable to your issue:
They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
Examples
Here are some ideas for how to use Init Containers:
Wait for a service to be created with a shell command like:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
Register this Pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d ‘instance=$()&ip=$()’
Wait for some time before starting the app Container with a command like sleep 60.
Reference documentation:
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
Alex has pointed out correct practice to follow with kubernetes. But if you still want directly depend on other pod phase you can use this pod-dependency-init-container that I have build. This will check if any pod with given labels is running before starting your pod.
I have a Dropwizard application that holds short-lived sessions (phone calls) in memory. There are good reasons for this, and we are not changing this model in the near term.
We will be setting up Kubernetes soon and I'm wondering what the best approach would be for handling shutdowns / rolling updates. The process will need to look like this:
Remove the DW node from the load balancer so no new sessions can be started on this node.
Wait for all remaining sessions to be completed. This should take a few minutes.
Terminate the process / container.
It looks like kubernetes can handle this if I make step 2 a "preStop hook"
http://kubernetes.io/docs/user-guide/pods/#termination-of-pods
My question is, what will the preStop hook actually look like? Should I set up a DW "task" (http://www.dropwizard.io/0.9.2/docs/manual/core.html#tasks) that will just wait until all sessions are completed and CURL it from kubernetes? Should I put a bash script that polls some sessionCount service until none are left in the docker container with the DW app and execute that?
Assume you don't use the preStop hook, and a pod deletion request has been issued.
API server processes the deletion request and modify the pod object.
Endpoint controller observes the change and removes the pod from the list of endpoints.
On the node, a SIGTERM signal is sent to your container/process.
Your process should trap the signal and drains all existing requests. Note that this step should not take more than the defined TerminationGracePeriod defined in your pod spec.
Alternatively, you can use the preStop hook which blocks until all the requests are drained. Most likely, you'll need a script to accomplish this.