Filebeat, havester closed after inactivity, however doesn't start again - elastic-stack

I'm trying to setup filebeat (8.3.3). I'm experiencing that the harvester closes after 5 min. due to no changes to the file, but never starts again. I know I can change the timeout by setting the value close.on_state_change.inactive higher than the longest inactive period for log messages being written.
I was just under the expression that the harvester would revive, if a change to file occurs ? This I unfortunately don't see happening.
In the doc for close.on_state_change.inactive it says:
If the closed file changes again, a new harvester is started and the latest changes will be picked up after prospector.scanner.check_interval has elapsed.
Am I misunderstanding anything in the expected behaviour here ?

Related

is there a way to disable a mega-detailed error message in Rundeck's failed executions?

The message that is posted every time an execution fails is too verbose and creates too much noise. Is there a way to disable or hide it somehow? We have our own error messages within the scripts and don't need a red chunk of extra text displayed in the logs. Adding an error handler that will exit with code 0 is not an option because we still need the job to fail if a step fails.
That message is Rundeck standard output in case of failure (you can manipulate the loglevel but that doesn't affect the NonZeroResultCode last line, just the text from your commands/scripts), you can suggest that here. The rest are just workarounds as you mentioned.
This occurs even using the Quiet output.

MFT pipeline line unexpected exit

I have a MFT pipeline as below:
FaceDraw and FaceDetect are custom MFTs. This pipeline is running fine for first few frames and all of sudden it exits.
I tried analyzing it using MFTrace and narrowed to a spurious evenet as below:
CMFPresentationClockDetours::GetTime #0000004BF1114E50 Time 20687020hns
This is not expected and suddently popping up. This event further leading to below set of events that are causing pipe to drain-off
CMFClockStateSinkDetours::OnClockStop #0000004BF98AFE88 System time 668039142ms
CMFClockStateSinkDetours::OnClockStop #0000004BF98AFB28 System time 668039142ms
CMFTransformDetours::ProcessMessage #0000004BF1838970 Message type=0x00000000 MFT_MESSAGE_COMMAND_FLUSH, param=00000000
So the question is, what are the ways to figure out the initiator for this event? Are there any better tools for debugging issues like this?
Initially I suspected EVR, but the removal of EVR from the pipeline is not making any difference.

Quickfix reset sequence number at start time but not set ResetSeqNum in Logon message

When the quickfix initiator reconnects at startTime (defined in config) it deletes the files with sequence number, but does not set ResetSeqNumFlag to Y, and the server replies with a Logout message with text "seq msg number to low ..."
Is there a way to set ResetSeqNumFlag = Y only for this behavior? I don`t want to reset the sequence on every log-on.
This appears to be a QuickFIX/J quirk (some might consider it a bug). If ResetOnLogon=N then no ResetSeqNumFlag=Y is sent when the session start time triggers a logon. If ResetOnLogon=Y, the ResetSeqNumFlag=Y is sent on every logon. I believe this is not a big problem in practice because participants in a FIX session typically reset their sequence numbers locally after a session ends (logically ends at the end time, not a connection disconnect).
If you want to slightly modify the source code to implement this behavior, you'd modify the quickfix.Session next() method. You could add a local flag that indicates a session has restarted (per the schedule as determined by checkSessionTime()). Pass that flag to generateLogon() and that method would use it to determine when to send ResetSeqNumFlag=Y regardless of the ResetOnLogon configuration.
I don`t want to reset the sequence on every log-on.
Then don't do it! Set ResetOnLogon=N.
At StartTime, the session will reset sequence numbers always. If ResetOnLogon=N, then they won't reset again until the next StartTime.
The initiator and acceptor should always have matching ResetOnXXX settings.
What you are asking cannot, should not be done. You start you engine with some config and then you change the config while running. If something goes wrong it will be very difficult to pinpoint what started the issue.
Instead of doing ResetSeqNumFlag = Y try adding ResetOnLogon=Y in your config for the acceptor side(that is if you have control over it) or ResetOnLogout=Y / ResetOnDisconnect=Y in your initiator config file. That would be much easier and changing config while running, is possibly not the best solution.
Your logout(disconnect can happen anytime) will happen anyways at EndTime anyways and should be easier for your application.

How does Solaris SMF determine if something is to be in maintenance or to be restarted?

I have a daemon process that I wrote being executed by SMF. The problem is when an error occurs, I have fail code and then it will need to restart from scratch. Right now it is sending sys.exit(0) (Python), but SMF keeps throwing it in maintenance mode.
I've worked with SMF enough to know that it sometimes auto-restarts certain services (and lets others fail and have you deal with them like this). How do I classify this process as one that needs to auto-restart? Is it an SMF setting, a method of failing, what?
Manpage
Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:
startd/critical_failure_count
startd/critical_failure_period
The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.
Defaults in the source code
The defaults can be found in the source, the value depends on whether the service is "wait style":
if (instance_is_wait_style(inst))
critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
critical_failure_period = RINST_FAILURE_RATE_NS;
The defaults are either 5 failures/10 minutes or 5 failures/second:
#define RINST_START_TIMES 5 /* failures to consider */
#define RINST_FAILURE_RATE_NS 600000000000LL /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS NANOSEC /* 1 failure/second */
These variables can be set in the SMF as properties:
<service_bundle type="manifest" name="npm2es">
<service name="site/npm2es" type="service" version="1">
...
<property_group name="startd" type="framework">
<propval name='critical_failure_count' type='integer' value='10'/>
<propval name='critical_failure_period' type='integer' value='30'/>
<propval name="ignore_error" type="astring" value="core,signal" />
</property_group>
...
</service>
</service_bundle>
TL;DR
After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.
Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.
I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.
I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".
You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.
http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf
As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.

UIAutomation timeouts usage

Guys help me to understand the timeouts usage. The documentation gives quite a couple of words about them:
popTimeout- Retrieves the previous timeout value from a stack, restores it as the current timeout value, and returns it.
pushTimeout - Stores the current timeout value on a stack and sets a new timeout value.
They also provide some code:
target = UIATarget.localTarget();
target.pushTimeout(2);
// attempt element access
target.popTimeout();
But I don't exactly understand how and when to use them. Can anybode give an example?
During automation testing, some elements might not become visible right away. so instruments uses a timeout (default 5 seconds) to wait for requested elements. They call this the grace period.
Sometimes the default grace period might not be what you need, so you can change the timeout to a shorter or longer value.
Using the pushTimeout and popTimeout makes sure that the previous grace period is restored after calling popTimeout, without the need to remember the previous grace period.
For example: in one of my tests, I don't want to wait for a popover to become active, but I just want to know if there is a popover active, and dismiss it if there is:
target.pushTimeout(0.0);
if ( target.isDeviceiPad() && ! isNull( popOver= app.mainWindow().popover() ) )
{
UIALogger.logDebug(" dismiss popup by tapping somewhere");
popOver.dismiss();
target.delay(0.2);
}
target.popTimeout();
BTW, the isNull() is a custom function I made, but you probably understand what is going on.