How does Solaris SMF determine if something is to be in maintenance or to be restarted? - solaris

I have a daemon process that I wrote being executed by SMF. The problem is when an error occurs, I have fail code and then it will need to restart from scratch. Right now it is sending sys.exit(0) (Python), but SMF keeps throwing it in maintenance mode.
I've worked with SMF enough to know that it sometimes auto-restarts certain services (and lets others fail and have you deal with them like this). How do I classify this process as one that needs to auto-restart? Is it an SMF setting, a method of failing, what?

Manpage
Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:
startd/critical_failure_count
startd/critical_failure_period
The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.
Defaults in the source code
The defaults can be found in the source, the value depends on whether the service is "wait style":
if (instance_is_wait_style(inst))
critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
critical_failure_period = RINST_FAILURE_RATE_NS;
The defaults are either 5 failures/10 minutes or 5 failures/second:
#define RINST_START_TIMES 5 /* failures to consider */
#define RINST_FAILURE_RATE_NS 600000000000LL /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS NANOSEC /* 1 failure/second */
These variables can be set in the SMF as properties:
<service_bundle type="manifest" name="npm2es">
<service name="site/npm2es" type="service" version="1">
...
<property_group name="startd" type="framework">
<propval name='critical_failure_count' type='integer' value='10'/>
<propval name='critical_failure_period' type='integer' value='30'/>
<propval name="ignore_error" type="astring" value="core,signal" />
</property_group>
...
</service>
</service_bundle>
TL;DR
After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.

Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.
I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.
I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".

You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.
http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf

As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.

Related

How to make afl-fuzz not skip test cases when a timeout is reached

I am currently trying to fuzz a PDF viewer with the AFL fuzzer (American Fuzzy Lop).
My problem is quite simple, afl-fuzz expect the application to take an input and close after processing it. But, the PDF viewer is intended to open the document and stay open until closed. The result is that afl-fuzz reach the timeout for all initial inputs and decide to stop here.
...
[*] Validating target binary...
[*] Attempting dry run with 'id:000000,orig:myPDFsample00.pdf'...
[*] Spinning up the fork server...
[+] All right - fork server is up.
[!] WARNING: Test case results in a timeout (skipping)
[*] Attempting dry run with 'id:000001,orig:myPDFsample01.pdf'...
[!] WARNING: Test case results in a timeout (skipping)
[*] Attempting dry run with 'id:000002,orig:myPDFsample02.pdf'...
[-] PROGRAM ABORT : All test cases time out, giving up!
Location : perform_dry_run(), afl-fuzz.c:2883
I would like to know how to tell AFL to consider that reaching the timeout and get the program terminated is a "normal" behavior for the test case.
In fact, the usual way to do seems to simply instrument the code of the software you are looking at by adding an exit(0) after the parsing.
It seems quite basic, but I works...
The other way could be to change the meaning of a timeout in the AFL software. But, then, it won't detect 'hangs' when your software might enter a never ending loop.
So, the best way really seems to add an exit(0) (or return 0 if you are in main()) inside your target software just after the parsing is done.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

Quartz.net tracking and misfiring

I have a few questions regarding quartz.net.
What is it that keeps track of if there has been a missfire situation i Quartz.net?
What happens in the following scenarios:
If a job is run but cannot finnish due to some bug, does that count as a missfire or not?
What happens if i republish the solution, is the tracking reset?
Is there a way to receive information on what the scheduler has done and not been able to do?
I have the following code in my Run method:
IJobDetail dailyUserMailJob = new JobDetailImpl("DailyUserMailJob", null, typeof(Jobs.TestJob));
ITrigger trigger = TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.WithCronSchedule("0 0 4 1 * ?", x => x.WithMisfireHandlingInstructionFireAndProceed())
.Build();
this.Scheduler.ScheduleJob(dailyUserMailJob, trigger);
this.Scheduler.Start();
The job is supposed to run the first every month on 4 am.
When testing I have set the system clock so that the jobb is missed for one month. According to the documentation when using WithMisfireHandlingInstructionFireAndProceed the job should be run the first thing that happens, but it dosent. Is there something wrong with the code or could it be some other reason the job is not run when using WithMisfireHandlingInstructionFireAndProceed() ?
If a job is missed, there is logic to bring it back. However, there is a "window" on how far back to go.
<add key="quartz.jobStore.misfireThreshold" value="60000"/>
You can increase this value.
If you have an ADOStore, misfires are persisted. Thus "if the power goes out", when restarting...you can recover from misfires.
If you have a RamStore...if "the power goes out", everything was in memory to begin with..so you won't get mis-fire handling, because everything was "in memory" and the memory is lost.
..
If you use Sql Server (AdoStore) and put a Profiler/Trace on it, you'll see the engine "poll" for misfires.......with a "go back this far in time" based on the misfireThreshold.
See this link:
http://nurkiewicz.blogspot.com/2012/04/quartz-scheduler-misfire-instructions.html
for more detailed info. Which has a "withMisfireHandlingInstructionFireAndProceed" note.

Quickfix reset sequence number at start time but not set ResetSeqNum in Logon message

When the quickfix initiator reconnects at startTime (defined in config) it deletes the files with sequence number, but does not set ResetSeqNumFlag to Y, and the server replies with a Logout message with text "seq msg number to low ..."
Is there a way to set ResetSeqNumFlag = Y only for this behavior? I don`t want to reset the sequence on every log-on.
This appears to be a QuickFIX/J quirk (some might consider it a bug). If ResetOnLogon=N then no ResetSeqNumFlag=Y is sent when the session start time triggers a logon. If ResetOnLogon=Y, the ResetSeqNumFlag=Y is sent on every logon. I believe this is not a big problem in practice because participants in a FIX session typically reset their sequence numbers locally after a session ends (logically ends at the end time, not a connection disconnect).
If you want to slightly modify the source code to implement this behavior, you'd modify the quickfix.Session next() method. You could add a local flag that indicates a session has restarted (per the schedule as determined by checkSessionTime()). Pass that flag to generateLogon() and that method would use it to determine when to send ResetSeqNumFlag=Y regardless of the ResetOnLogon configuration.
I don`t want to reset the sequence on every log-on.
Then don't do it! Set ResetOnLogon=N.
At StartTime, the session will reset sequence numbers always. If ResetOnLogon=N, then they won't reset again until the next StartTime.
The initiator and acceptor should always have matching ResetOnXXX settings.
What you are asking cannot, should not be done. You start you engine with some config and then you change the config while running. If something goes wrong it will be very difficult to pinpoint what started the issue.
Instead of doing ResetSeqNumFlag = Y try adding ResetOnLogon=Y in your config for the acceptor side(that is if you have control over it) or ResetOnLogout=Y / ResetOnDisconnect=Y in your initiator config file. That would be much easier and changing config while running, is possibly not the best solution.
Your logout(disconnect can happen anytime) will happen anyways at EndTime anyways and should be easier for your application.

gsoap client call blocks when the server is not available

I am looking for a method to detect if the gsoap web service is available.
Unfortunately when the service is offline then the client gsoap calls block
for a long time. Setting the soap.recv_timeout and the soap.send_timeout to
zero do not help.
This is a bit late, but I finally found (what I think is) a better answer by skulking through the source code (why they don't document this, I don't know):
Look for "soap.connect_timeout". When I set this to 3, it times-out after 3 seconds as expected when the web service is unavailable.
The above recv_timeout and send_timeout didn't work for me in the case of "service unavailable".
I'm pretty sure that by setting soap.recv_timeout and soap.send_timeout with 0 means NO TIMEOUT. Try set this variables with 1 (1 means 1 second).
I came here looking for a solution to the same problem and recognized the erroneous part about setting recv_timeout to 0, but I had set it to 20 and still got no timeout, so I followed the second post and used connect_timeout, which did work as I intended.