I have several instances of "orchestrator" microservice that runs on different nodes and executes Spring Batch jobs. Only one instance has to be "active" and conduct the job at a time. The jobs are scheduled twice a day via #Scheduled annotation with cron expression.
So, mocriservice tries to execute jobs with a single identifying JobParameter that is a LocalDateTime.now() truncated to seconds to compensate time difference between OpenShift nodes my instances run on.
Underlying DB is Postgres 12, which transaction isolation level is set to repeatable read.
The problem seems imossible to me, but it happens and reproduces always. Job execution fails on each microservice instance with DuplicateKeyException on composite PK, which is (not suprisingly) job name and identifying parameter's hash.
The question is how is it possible and what am I missing? Any ideas?
Sorry for such a late answer. There were no problem at all, locks work correctly regardless transaction isolation level. We have two OpenShift clusters - active and inactive. Jobs were running on "inactive" nodes that are called so just because no client traffic routed to them. As it turned out, production support had no access to "inactive" nodes logs :)
Related
I am trying to find a solution to run a cron job in a Kubernetes-deployed app without unwanted duplicates. Let me describe my scenario, to give you a little bit of context.
I want to schedule jobs that execute once at a specified date. More precisely: creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization.
My application is running inside a Kubernetes cluster, and I cannot assume that there always will be only one instance of it running at the any moment in time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job runs exactly once in the whole cluster.
I tried to find solutions for this problem and came up with the following ideas.
Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
Not possible in my case, since the duplicate jobs might run on other machines!
Utilize the Kubernetes CronJob API.
I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean (Client Version: v1.22.4, Server Version: v1.21.5).
After thinking about a solution for a rather long time I found it.
The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date and time.
The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separate database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run into very similar problems again. However, if you think about what is the actual work to be done in that service, you realize that it is very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.
Are there any existing out of the box job queue framework? basic idea is
someone to enqueue a job with job status New
(multiple) workers get a job and work on it, mark the job as Taken. One job can only be running on at most one worker
something will monitor the worker status, if the running jobs exceed predefined timeout, will be re-queued with status New, could be worker health issue
Once a worker completes a task, it marks the task as Completed in the queue.
something keeps cleaning up completed tasks. Or at step #4 when worker completes a task, the worker simply dequeues the task.
From my investigation, things like Kafka (pub/sub) or MQ (push/pull & pub/sub) or cache (Redis, Memcached) are mostly sufficient for this work. However, they all require some sort of development around its core functionality to become a fully functional job queue.
Also looked into relational DB, the ones supports "SELECT * FOR UPDATE SKIP LOCKED" syntax is also a good candidate, this again requires a daemon between the DB and worker, which means extra effort.
Also looked into the cloud solutions, Azure Queue storage, etc. similar assessment.
So my question is, is there any out of the box solution for job queue, that are tailored and dedicated for one thing, job queuing, without much effort to set up?
Thanks
Take a look at Python Celery. https://docs.celeryproject.org/en/stable/getting-started/introduction.html
The default mode uses RabbitMQ as the message broker, but other options are available. Results can be stored in a DB if needed.
Mongock looks very promising. We want to use it inside a kubernetes service that has multiple replicas that run in parallel.
We are hoping that when our service is deployed, the first replica will acquire the mongockLock and all of its ChangeLogs/ChangeSets will be completed before the other replicas attempt to run them.
We have a single instance of mongodb running in our kubernetes environment, and we want the mongock ChangeLogs/ChangeSets to execute only once.
Will the mongockLock guarantee that only one replica will run the ChangeLogs/ChangeSets to completion?
Or do I need to enable transactions (or some other configuration)?
I am going to provide the short answer first and then the long one. I suggest you to read the long one too in order to understand it properly.
Short answer
By default, Mongock guarantees that the ChangeLogs/changeSets will be run only by one pod at a time. The one owning the lock.
Long answer
What really happens behind the scenes(if it's not configured otherwise) is that when a pod takes the lock, the others will try to acquire it too, but they can't, so they are forced to wait for a while(configurable, but 4 mins by default)as many times as the lock is configured(3 times by default). After this, if i's not able to acquire it and there is still pending changes to apply, Mongock will throw an MongockException, which should mean the JVM startup fail(what happens by default in Spring).
This is fine in Kubernetes, because it ensures it will restart the pods.
So now, assuming the pods start again and changeLogs/changeSets are already applied, the pods start successfully because they don't even need to acquire the lock as there aren't pending changes to apply.
Potential problem with MongoDB without transaction support and Frameworks like Spring
Now, assuming the lock and the mutual exclusion is clear, I'd like to point out a potential issue that needs to be mitigated by the the changeLog/changeSet design.
This issue applies if you are in an environment such as Kubernetes, which has a pod initialisation time, your migration take longer than that initialisation time an the Mongock process is executed before the pod becomes ready/health(and it's a condition for it). This last condition is highly desired as it ensures the application runs with the right version of the data.
In this situation imagine the Pod starts the Mongock process. After the Kubernetes initialisation time, the process is still not finished, but Kubernetes stops the JVM abruptly. This means that some changeSets were successfully executed, some other not even started(no problem, they will be processed in the next attempt), but one changeSet was partially executed and marked as not done. This is the potential issue. The next time Mongock runs, it will see the changeSet as pending and it will execute it from the beginning. If you haven't designed your changeLogs/changeSets accordingly, you may experience some unexpected results because some part of the data process covered by that changeSet has already taken place and it will happen again.
This, somehow needs to be mitigated. Either with the help of mechanisms like transactions, with a changeLog/changeSet design that takes this into account or both.
Mongock currently provides transactions with “all or nothing”, but it doesn’t really help much as it will retry every time from scratch and will probably end up in an infinite loop. The next version 5 will provide transactions per ChangeLogs and changeSets, which together with good organisation, is the right solution for this.
Meanwhile this issue can be addressed by following this design suggestions.
Just to follow up... Mongock's locking mechanism works fine with replicas. To solve the "long-running script" problem, we will run our Mongock scripts from Kubernetes initContainer. K8s will wait for the initContainers to finish before it starts the pod's main service containers.
For transactions, we will follow the advice above of making our scripts idempotent.
I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)
One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.
Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.
When we run multiple concurrent jobs with different parameters, how can we control (stop, restart) the appropriate jobs? Our internal code provides the jobExecution object, but under the covers The jobOperator uses the job name to get the job instance.
In our case all of the jobs are from "do-stuff.xml" (okay, it's sanitized and not very original). After looking at the spring-batch source code, our concern is that if there is more then one job running and we stop a job it will take the most recently submitted job and stop it.
The JobOperator will allow you to fetch all running executions of the job using getRunningExecutions(String jobName). You should be able to iterate over that list to find the one you want. Then, just call stop(long executionId) on the one you want.
Alternatively, we've also implemented listeners (both at step and chunk level) to check an outage status table. When we want to implement a system-wide outage, we add the outage there and have our listener throw an exception to bring our jobs down. once the outage is lifted, all "failed" executions may be restarted.