Quartz job does not fire, often - quartz-scheduler

I'm using quartz 2.2.1 and Spring 3.2.5, both in a Maven / war project.
My war-file deploys fine under apache-tomcat-7.x, and the logs
indicate that all of the quartz jobs are loaded. Here's where
the trouble begins.
Several jobs fire as determined by their triggers. But consistently, in many instances, the trigger does not fire the job when it should, and
other times does fire the job as expected. Why?
It's as if quartz has a bug interpreting triggers -- especially when
triggers result in multiple jobs at a single time-slice. (Eg, on each
10th minute of the hour, 3 different jobs should fire).
Can anybody explain what's going on? To my thinking, there should
not be any missed triggers at all.
Thanks.

For this particular issue, changing the misfireTreshold to a bigger value, solves the issue. We had 10 milisecond configured, changing it's value to 10000 solved the issue.
<add key="quartz.jobStore.misfireThreshold" value="10000" />

Related

What queuing tools or algorithms support fair queuing of jobs

I am hitting a well known problem, but I can't find a simple answer that tells me how to solve it.
I would appreciate you directing me by answering which feature I should look for in available queuing software or suitable algorithms if the solution requires programming in addition to the tools. and if you can direct me to Python supported tools, it would be helpful
My problem is that I get over the span of the day jobs which deploy 10, 100 or 1000 tests (I exaggerate , but it helps make a point). Many jobs deploy 10 tests, some deploy 100 tests and one or two deploy 1000 tests.
I want to deploy the tests in such a manner that the delay in execution is spread in a fair manner between all jobs. Let me explain myself.
If the very large job takes 2 hours on a idle server, it would be acceptable if it completes after 4 hours.
If a small job takes 3 minutes on an idle server, it would be acceptable if it completes after 15 minutes.
I want the delay of running the jobs to be spread in a fair way, so jobs that started earlier don't get too delayed. If it looks that the job is going to be delayed more than allowed it's priority will increase.
I think that prioritizing queues may be the solution, so dynamically changing the weights on a large queue will make it faster when needed.
Is there a queue software that knows how to do the above automatically. Lets say that I give each job some time limit and the queue software knows how to prioritize the tests from each queue so that no job is delayed too much?
Thanks.
Adding information following Jim's comments.
Not enough information to supply an answer. Is a job essentially just a list of tests? Can multiple tests for a single job be run concurrently? Do you always run all tests for a job? – Jim Mischel 14 hours ago
Each job deploys between 10 to 1000 tests.
The test can run concurrently to all other tests from the same or other users without conflicts.
All tests that were deploy by a job, are planned to run.
Additional info:
I've learned so far that Prioritized Queues are actually about applying weights to items in a single queue, where items with the hightest are pulled first. If two or more items have the same highest priority, the first item to arrive will be executed first.
When I pondered about Priority Queues it was more in the way of:
Multiple Queues, where each queue has a priority assigned to the entire queue.
The priority can be changed dynamically in runtime, based on some condition, e.g. setting a time limit on the execution of the entire queue.

Do not reschedule Quartz stateful job

I have a Quartz v1.x stateful job. The repeat interval is let's say 1 minute. The job itself typically terminates within a second, but it might happen that it lasts long, let's say 5 minutes. The scheduler prevents parallel run, but when the long running job finishes, it starts it again over and over again those, which were missed during the long running job. In this example, 5 other runs will be scheduled right after the long execution finishes. What I want is to make the scheduler "forget" the missed starts. E.g. if a job starts at 12:00 and finished at 12:05, then simply omit the runs at 12:01, 12:02, 12:03, 12:04, and depending on the exact finish, even 12:05. Is this somehow possible?
I need stateful job for preventing the parallel execution. Stateless job with proper annotation is not an option, because we are using Quartz version 1.x. I already tried playing around with the misfire policies (e.g. MISFIRE_INSTRUCTION_DO_NOTHING), but it seems that these are not intended for such situations. Could anyone help me?

How does Quartz store job details between restarts (regarding last successful run, misfires etc.)?

A task is scheduled in the Quartz-scheduler for every hour, starting at 9 AM.
Application stops at 10 AM and is restarted at 12 PM. In that case two executions at 10 AM and 11 AM will be missed.
In that case when Scheduler starts again, how many misfires will be considered?
As job was executed at 9AM, it should consider two misfires from 10 AM and 11 AM. If it is so, then how can Quartz identify the last successful schedule as application is already restarted?
You are asking two questions:
How many misfires will be considered?
As you guessed, two misfires (10 AM and 11 AM) will be detected.
However, Quartz may or may not consider all of them: depending on the misfire instruction configured in each trigger, Quartz may decide to just consider the last misfire, or ignore them all.
How does Quartz store job details between restarts?
Depending on your configuration, Quartz will store job data in a database via its JDBCJobStore (Java) / AdoJobStore (.NET), or in RAM via its RAMJobStore.
When using a database store, Quartz will persist all job details to it as they are scheduled, run, finished etc.; and retrieve said details from it upon restart.
When using a RAM store, job information will not be persisted between Scheduler runs. If the Scheduler gets stopped and then restarted, all jobs will need to be scheduled again; also, no misfires will be detected.
All of this is explained in Quartz's official documentation, I suggest that you take a look around over there for more detailed info.

Work around celerybeat being a single point of failure

I'm looking for recommended solution to work around celerybeat being a single point of failure for celery/rabbitmq deployment. I didn't find anything that made sense so far, by searching the web.
In my case, once a day timed scheduler kicks off a series of jobs that could run for half a day or longer. Since there can only be one celerybeat instance, if something happens to it or the server that it's running on, critical jobs will not be run.
I'm hoping there is already a working solution for this, as I can't be the only one who needs reliable (clustered or the like) scheduler. I don't want to resort to some sort of database-backed scheduler, if I don't have to.
There is an open issue in celery github repo about this. Don't know if they are working on it though.
As a workaround you could add a lock for tasks so that only 1 instance of specific PeriodicTask will run at a time.
Something like:
if not cache.add('My-unique-lock-name', True, timeout=lock_timeout):
return
Figuring out lock timeout is well, tricky. We're using 0.9 * task run_every seconds if different celerybeats will try to run them at different times.
0.9 just to leave some margin (e.g. when celery is a little behind schedule once, then it is on schedule which would cause lock to still be active).
Then you can use celerybeat instance on all machines. Each task will be queued for every celerybeat instance but only one task of them will finish the run.
Tasks will still respect run_every this way - worst case scenario: tasks will run at 0.9*run_every speed.
One issue with this case: if tasks were queued but not processed at scheduled time (for example because queue processors was unavailable) - then lock may be placed at wrong time causing possibly 1 next task to simply not run. To go around this you would need some kind of detection mechanism whether task is more or less on time.
Still, this shouldn't be a common situation when using in production.
Another solution is to subclass celerybeat Scheduler and override its tick method. Then for every tick add a lock before processing tasks. This makes sure that only celerybeats with same periodic tasks won't queue same tasks multiple times. Only one celerybeat for each tick (one who wins the race condition) will queue tasks. In one celerybeat goes down, with next tick another one will win the race.
This of course can be used in combination with the first solution.
Of course for this to work cache backend needs to be replicated and/or shared for all of servers.
It's an old question but I hope it helps anyone.

Select node in Quartz cluster to execute a job

I have some questions about Quartz clustering, specifically about how triggers fire / jobs execute within the cluster.
Does quartz give any preference to nodes when executing jobs? Such as always or never the node that executed the same job the last time, or is it simply whichever node that gets to the job first?
Is it possible to specify the node which should execute the job?
The answer to this will be something of a "it depends".
For quartz 1.x, the answer is that the execution of the job is always (only) on a more-or-less random node. Where "randomness" is really based on whichever node gets to it first. For "busy" schedulers (where there are always a lot of jobs to run) this ends up giving a pretty balanced load across the cluster nodes. For non-busy scheduler (only an occasional job to fire) it may sometimes look like a single node is firing all the jobs (because the scheduler looks for the next job to fire whenever a job execution completes - so the node just finishing an execution tends to find the next job to execute).
With quartz 2.0 (which is in beta) the answer is the same as above, for standard quartz. But the Terracotta folks have built an Enterprise Edition of their TerracottaJobStore which offers more complex clustering control - as you schedule jobs you can specify which nodes of the cluster are valid for the execution of the job, or you can specify node characteristics/requisites, such as "a node with at least 100 MB RAM available". This also works along with ehcache, such that you can specify the job to run "on the node where the data keyed by X is local".
I solved this question for my web application using Spring + AOP + memcached. My jobs do know from the data they traverse if the job has already been executed, so the only thing I need to avoid is two or more nodes running at the same time.
You can read it here:
http://blog.oio.de/2013/07/03/cluster-job-synchronization-with-spring-aop-and-memcached/