We are using quartz 2.1.5, on 64 bit machine (clustered, 2 instances, 16GB ram). We have around 8000 triggers in the system.
Every second we have around 50 triggers - they get fired every second.
org.quartz.threadPool.threadCount = 50
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=100
org.quartz.scheduler.idleWaitTime=15000
#org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=0 (this is not set)
Quartz is able to handle the load, but triggers get fired ahead of time?
batchTriggerAcquisitionMaxCount - can we increase it to 500 and keep batchTriggerAcquisitionFireAheadTimeWindow at 1000 (1 sec), is there any disadvantage of these configuration?
Any other way?
with following configuration, it seems to work fine.
org.quartz.threadPool.threadCount = 100
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=500
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=1000
org.quartz.scheduler.idleWaitTime=25000
When quartz wants to run your triggers it calls this method :
triggers = qsRsrcs.getJobStore().acquireNextTriggers(now + idleWaitTime, Math.min(availThreadCount, qsRsrcs.getMaxBatchSize()), qsRsrcs.getBatchTimeWindow());
Where :
idleWaitTime is org.quartz.scheduler.idleWaitTime
availThreadCount is the number of free threads (will be less or equal to org.quartz.threadPool.threadCount)
qsRsrcs.getMaxBatchSize() is org.quartz.scheduler.batchTriggerAcquisitionMaxCount
qsRsrcs.getBatchTimeWindow() is org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow
it leads to an SQL request like :
SELECT * FROM TRIGGERS WHERE NEXT_FIRE_TIME <= now + idleWaitTime + qsRsrcs.getBatchTimeWindow() LIMIT Math.min(availThreadCount, qsRsrcs.getMaxBatchSize())
So yes, Quartz always runs triggers ahead of time with idleWaitTime + qsRsrcs.getBatchTimeWindow(). The minimum number of triggers that it takes will be if you set getBatchTimeWindow to zero, and idleWaitTime to 1000 (it's a minimal value). In this case it will still take triggers that are supposed to happen 1 second ahead of time in addition to those, which are expected to run.
If you want to stop taking triggers ahead of time completely you can set batchTriggerAcquisitionMaxCount to 1. The downside in this case is that you can get too many SQL requests. You can try to play with batchTriggerAcquisitionMaxCount parameter and find the value that suits you best.
BTW Looking on the Quartz code you can see that setting batchTriggerAcquisitionMaxCount bigger than threadCount makes no sense.
Related
I have the following problem which I am unable to solve:
I have a situation where a security point (added as delay) holds every half an hour a 15 min break. After the break, the security guards increase their speed till the queue is shorter than 10pp.
I wanted to model this as follows: a state chart with delay.set_capacity(0) after 30 minutes and delay.set_capacity(1) again after the 15 min break. For the increased speed after the break, I added an additional state with condition: queue.size()>10 and now I want to set the action such that the delay function changes the delay time from exponential (1/10) to exponential (1/5) as long as queue.size()>10.
Anyone experience with which function in the action box to use? Or would you suggest a different function?
Since you are using, or at least want to use a statechart I would suggest the following design, where you have composite states inside the working state to indicate if the security agent is working fast or normal and a message transition to let it move from one state to the next.
It is advised to use a message transition and trigger it as needed instead of a conditional state which gets chected for every change inside the agent since this can be a computational expensive exercise.
I assume you already implemented the correct capacity settings for the different on enter actions for working and breaking
Now you simply need to send the message every time an agent enters the queue and every time it exits the delay block, and of course, see the delay time based on the state of the statechart.
Aee screenshot below.
Actually what I want to do,
I created dashboards to monitor the alert status in grafana.
I created fake data in my system to simulate my alert situations on these boards. The time of this data covers the range now - now + 12h. In fact, it takes a long time to analyze the alert status in real data. For this reason, I cannot be very flexible on my alert rules. I have to wait until the end of this period to see alert status in the system. (I have many states like this actually.) Grafana creates pending, alerting, and ok states according to the records in my database. Is there a method to quickly verify my tests without waiting for this time?
The main problem is that it is fairly expensive to do in a data source agnostic way. The way worked in Bosun is you would select a time range, and then an interval or a number of queries to run.
Setting both From and To enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields Intervals and Step Duration at 3 Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of From to To and Step Duration is how much time in minutes should be between intervals. Doing a test over time will populate the Timeline tab 5 which draws a clickable graphic of severity states for each item in the set:
It would then run all those queries with a pool limiting simultaneous queries. For an interval of say 5 minutes, it would run adjacent 5 minute queries.
So this would speed up the alert authoring and testing workflow significantly. But it would best be implemented as a job system. This is because with more expensive queries, or range/interval combination that is a fair amount of runs, it may take a minute or so - so having to wait on an open network connection is less ideal.
So I found I generally used in two modes:
To tweak a specific alert that had fired at some time
To get a general overview of how much the alert rule would trigger for the historical data
For the general over, a larger time range is generally wanted, which means more queries if the interval is kept the same. And with a feature like FOR (Pending), you would have to use the same interval it would actually run at.
So possible, has some limitations, and some care needs to be taken to do it right. But extremely useful in my experience.
I'm trying to simulate pallet behavior by using batch and move to. This works fine except towards the end where the number of elements left is smaller than the batch size, and these never get picked up. Any way out of this situation?
Have tried messing with custom queues, pickup/dropoff pairs.
To elaborate, the batch object has a queue size of 15. However once the entire set has been processed a number of elements less than 15 remain which don't get picked up by the subsequent moveTo block. I need to send the agents to the subsequent block once the queue size falls below 15.
You can dynamically change the batch size of your Batch object towards "the end" (whatever you mean by that :-) ). You need to figure out when to change the batch size (as this depends on your model). But once it is time to adjust, you can call myBatchItem.set_batchSize(1) and it will now batch things together individually.
However, a better model design might be to have a cool-down period before the model end, i.e. stop taking model measurements before your batch objects run out of agents to batch.
You need to know what the last element is somehow for example using a boolean variable called isLast in your agent that is true for the last agent.
and in the batch you have to change the batch size programatically.. maybe like this in the on enter action of your batch:
if(agent.isLast)
self.set_batchSize(self.size());
To determine if the "end" or any lack of supply is reached, I suggest a timeout. I would save a timestamp in a variable lastBatchDate in the OnExit code of the batch block:
lastBatchDate = date();
A cyclically activated event checkForLeftovers will check every once in a while if there is objects waiting to be batched and the timeout (here: 10 minutes) is reached. In this case, the batch size will be reduced to exactly the number of waiting objects, in order for them to continue in a smaller batch:
if( lastBatchDate!=null //prevent a NullPointerError when date is not yet filled
&& ((date().getTime()-lastBatchDate.getTime())/1000)>600 //more than 600 seconds since last batch
&& batch.size()>0 //something is waiting
&& batch.size()<BATCH_SIZE //not more then a normal batch is waiting
){
batch.set_batchSize(batch.size()); // set the batch size to exactly the amount waiting
}
else{
batch.set_batchSize(BATCH_SIZE); // reset the batch size to the default value BATCH_SIZE
}
The model will look something like this:
However, as Benjamin already noted, you should be careful if this is what you really need to model. Take care for example on these aspects:
Is the timeout long enough to not accidentally push smaller batches during normal operations?
Is the timeout short enough to have any effect?
Is it ok to have a batch of a smaller size downstream in your process?
etc.
You might just want to make sure upstream that the number of objects reaches the batching station are always filling full batches, or you might to just stop your simulation before the line "runs dry".
You can see the model and download the source code here.
The following discussion is in the context of Apache Flink:
Imagine that we have a keyedStream whose key is its id and event time is its timestamp, if we want to calculate how many events arrived within 10 minutes for each event.
The problems need to be solved are:
How to design the window ?
We can create a window of 10 minutes after each event arrives, but this mean that for each event, there will be a delay of 10 minutes because the wait for the window of 10 minutes.
We can create a window of 10 minutes which takes the timestamp of each event as the maximum timestamp in this window, which means that we don't need to wait for 10 minutes, because we take the last 10 minutes of elements before the element arrives. But this kind of window is not easy to define, as far as I know.
How to deal with memory or other resource issues ? Even we succeed to create a window, maybe the kind of ids of events are diverse, so many window like this, how the system keep their states in the memory ? There is a big possibility of stakoverflow of memory.
Maybe there are some problems that I don't mention here, or maybe there are some good solutions except window(i.e. Patterns). If you have a good solutions, please give me a clue, thank you.
You could do this with a GlobalWindow and a Trigger than fires on every event and an Evictor that removes events that are more than 10 minutes old before counting the remaining events. (A naive implementation could easily perform very poorly, however.)
Yes, this may require keeping a lot of state -- you'll be keeping every event from the past 10 minutes (well, you only need to store the timestamp from each event). If you setup the RocksDB state backend then Flink will spill to disk if need be, but with some obvious performance penalty. Probably better to use a cluster large enough to hold 10 minutes of traffic in memory. Even at one million events per second, each with a 32-bit timestamp, that's only 2.4GB in 10 minutes (1 million events per second x 600 seconds x 4 bytes per event) -- doesn't seem like a problem at all.
I run < 24 checks on my systems. Servers are not regularly heavily loaded. Load averages keep well under 1 during normal operation.
I have noticed a re-occurring issue where the check-cpu check would start triggering high load averages on systems where there was no organic cause for high load. Further investigation showed the high load report was actually due to the check-cpu script running in parallel with other checks. Outside of the checks executing, cpu load was fine.
I upgraded from sensu 0.20 to 0.23 and continued to observe the same issue.
We found that a re-start of the sensu-server and sensu-client services would resolve the problem for a period of time (approximately 24 hours) and then it would return.
We theorized at this point, there must be some sort of time-delay in the dispatch / execution of the checks on the host which causes this overlap to eventually occur.
All checks are set to run at an interval of 30 or 60.
I decided to set the interval of the check-cpu check to 83, and the issue has not occurred since. Presumably because the check-cpu check does not coincide with any others, thus not seeing high cpu load during that short moment.
Is this some sort of inherent scheduling issue with sensu? Is it supposed to know how to dispatch checks with adequate spacing, or is this something that should be controlled by the interval parameter?
Thanks!
I have noticed that the checks drift in execution time. i.e they do not run exactly every 30 seconds but every 30.001s or something like that. I guess the drift might be different on different checks. So eventually you will run into the problem that the checks sync up and all run at the same time, causing the problem. Running more checks at regular intervals (30s, 60s etc) will make this problem occur more often. If you want a change to this problem you have to report it to sensu directly. I think they might fix it eventually since they probably want the system to be scalable.