Quartz.net tracking and misfiring - quartz-scheduler

I have a few questions regarding quartz.net.
What is it that keeps track of if there has been a missfire situation i Quartz.net?
What happens in the following scenarios:
If a job is run but cannot finnish due to some bug, does that count as a missfire or not?
What happens if i republish the solution, is the tracking reset?
Is there a way to receive information on what the scheduler has done and not been able to do?
I have the following code in my Run method:
IJobDetail dailyUserMailJob = new JobDetailImpl("DailyUserMailJob", null, typeof(Jobs.TestJob));
ITrigger trigger = TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.WithCronSchedule("0 0 4 1 * ?", x => x.WithMisfireHandlingInstructionFireAndProceed())
.Build();
this.Scheduler.ScheduleJob(dailyUserMailJob, trigger);
this.Scheduler.Start();
The job is supposed to run the first every month on 4 am.
When testing I have set the system clock so that the jobb is missed for one month. According to the documentation when using WithMisfireHandlingInstructionFireAndProceed the job should be run the first thing that happens, but it dosent. Is there something wrong with the code or could it be some other reason the job is not run when using WithMisfireHandlingInstructionFireAndProceed() ?

If a job is missed, there is logic to bring it back. However, there is a "window" on how far back to go.
<add key="quartz.jobStore.misfireThreshold" value="60000"/>
You can increase this value.
If you have an ADOStore, misfires are persisted. Thus "if the power goes out", when restarting...you can recover from misfires.
If you have a RamStore...if "the power goes out", everything was in memory to begin with..so you won't get mis-fire handling, because everything was "in memory" and the memory is lost.
..
If you use Sql Server (AdoStore) and put a Profiler/Trace on it, you'll see the engine "poll" for misfires.......with a "go back this far in time" based on the misfireThreshold.
See this link:
http://nurkiewicz.blogspot.com/2012/04/quartz-scheduler-misfire-instructions.html
for more detailed info. Which has a "withMisfireHandlingInstructionFireAndProceed" note.

Related

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

(Laravel 5) Monitor and optionally cancel an ALREADY RUNNING job on queue

I need to achieve the ability to monitor and be able to cancel an ALREADY RUNNING job on queue.
There's a lot of answers about deleting QUEUED jobs, but not on an already running one.
This is the situation: I have a "job", which consists of HUNDREDS OF THOUSANDS rows on a database, that need to be queried ONE BY ONE against a web service.
Every row needs to be picked up, queried against a web service, stored the response and its status updated.
I had that already working as a Command (launching from / outputting to console), but now I need to implement queues in order to allow piling up more jobs from more users.
So far I've seen Horizon (which doesn't runs on Windows due to missing process control libs). However, in some demos seen around it lacks (I believe) a couple things I need:
Dynamically configurable timeout (the whole job may take more than 12 hours, depending on the number of rows to process on the selected job)
Ability to CANCEL an ALREADY RUNNING job.
I also considered the option to generate EACH REQUEST as a new job instead of seeing a "job" as the whole collection of rows (this would overcome the timeout thing), but that would give me a Horizon "pending jobs" list of hundreds of thousands of records per job, and that would kill the browser (I know Redis can handle this without itching at all). Further, I guess is not possible to cancel "all jobs belonging to X tag".
I've been thinking about hitting an API route, fire the job and decouple it from the app, but I'm seeing that this requires forking processes.
For the ability to cancel, I would implement a database with job_id, and when the user hits an API to cancel a job, I'd mark it as "halted". On every loop I would check its status and if it finds "halted" then kill itself.
If I've missed any aspect just holler and I'll add it or clarify about it.
So I'm asking for an advice here since I'm new to Laravel: how could I achieve this?
So I finally came up with this (a bit clunky) solution:
In Controller:
public function cancelJob()
{
$jobs = DB::table('jobs')->get();
# I could use a specific ID and user owner filter, etc.
foreach ($jobs as $job) {
DB::table('jobs')->delete($job->id);
}
# This is a file that... well, it's self explaining
touch(base_path(config('files.halt_process_signal')));
return "Job cancelled - It will stop soon";
}
In job class (inside model::chunk() function)
# CHECK FOR HALT SIGNAL AND [OPTIONALLY] STOP THE PROCESS
if ($this->service->shouldHaltProcess()) {
# build stats, do some cleanup, log, etc...
$this->halted = true;
$this->service->stopProcess();
# This FALSE is what it makes the chunk() method to stop looping
return false;
}
In service class:
/**
* Checks the existence of the 'Halt Process Signal' file
*
* #return bool
*/
public function shouldHaltProcess() :bool
{
return file_exists($this->config['files.halt_process_signal']);
}
/**
* Stop the batch process
*
* #return void
*/
public function stopProcess() :void
{
logger()->info("=== HALT PROCESS SIGNAL FOUND - STOPPING THE PROCESS ===");
$this->deleteHaltProcessSignalFile();
return ;
}
It doesn't looks quite elegant, but it works.
I've surfed the whole web and many goes for Horizon or other tools that doesn't fit my case.
If anyone has a better way to achieve this, it's welcome to share.
Laravel queue have 3 important config:
1. retry_after
2. timeout
3. tries
See more: https://laravel.com/docs/5.8/queues
Dynamically configurable timeout (the whole job may take more than 12
hours, depending on the number of rows to process on the selected job)
I think you can config timeout + retry_after about 24h.
Ability to CANCEL an ALREADY RUNNING job.
Delete job in jobs table
Delete process by process id in your server
Hope it help you :)

Trouble with agent state chart

I'm trying to create an agent statechart where a transition should happen every day at 4 pm (except weekends).
I have already tried:
1. a conditional transition (condition: getHourOfDay() == 16)
2: A timeout transition that will "reinsert" my agent into the chart every 1 s and check if time = 16.
My code is still not running, does anyone have any idea how to solve it?
This is my statechart view. Customer is a single resource that is supposed to "get" the products out of my stock everyday at 4pm. It is supposed to happen in the "Active" state.
I have set a timeout transition (from Active-Active) that runs every 1s.
Inside my "Active" state in the "entrance action" i wrote my code to check if it is 4 pm and run my action if so.
I thought since i set a timeout transition it would check my condition every 1s, but apparently it is not working.
Your agent does not enter the Active state via an internal transition.
Redraw the transition to actually go out of the Active state and then enter it again as below:
Don't use condition-based transitions, for performance reasons. In your case, it also never triggers because it is only evaluated when something happens in the model. Incidentally, that is not the case at 4pm.
re your timeout approach: Why would you "reinsert" your agent into its own statechart? Not sure I understand...
Why not set up a schedule or event with your recurrence requirement and make it send a message to the statechart: stateChart.fireEvent("trigger!");. In your statechart, add a message-based transition that waits for this message. This will work.
Be careful to understand the difference between the Statechart.fireEvent() and the Statechart.receiveMessage() functions, though.
PS: and agree with Felipe: please start using SOF the way it is meant, i.e. also mark replies as solved. It helps us but also future users to quickly find solutions :-) cheers

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

How does Solaris SMF determine if something is to be in maintenance or to be restarted?

I have a daemon process that I wrote being executed by SMF. The problem is when an error occurs, I have fail code and then it will need to restart from scratch. Right now it is sending sys.exit(0) (Python), but SMF keeps throwing it in maintenance mode.
I've worked with SMF enough to know that it sometimes auto-restarts certain services (and lets others fail and have you deal with them like this). How do I classify this process as one that needs to auto-restart? Is it an SMF setting, a method of failing, what?
Manpage
Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:
startd/critical_failure_count
startd/critical_failure_period
The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.
Defaults in the source code
The defaults can be found in the source, the value depends on whether the service is "wait style":
if (instance_is_wait_style(inst))
critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
critical_failure_period = RINST_FAILURE_RATE_NS;
The defaults are either 5 failures/10 minutes or 5 failures/second:
#define RINST_START_TIMES 5 /* failures to consider */
#define RINST_FAILURE_RATE_NS 600000000000LL /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS NANOSEC /* 1 failure/second */
These variables can be set in the SMF as properties:
<service_bundle type="manifest" name="npm2es">
<service name="site/npm2es" type="service" version="1">
...
<property_group name="startd" type="framework">
<propval name='critical_failure_count' type='integer' value='10'/>
<propval name='critical_failure_period' type='integer' value='30'/>
<propval name="ignore_error" type="astring" value="core,signal" />
</property_group>
...
</service>
</service_bundle>
TL;DR
After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.
Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.
I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.
I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".
You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.
http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf
As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.