Nagios retry interval when they have OK or UP state - centos

I configure one Linux Host to Nagios Monitoring Server Using NRPE Plugin.
For this I follow the below URL
http://www.tecmint.com/how-to-add-linux-host-to-nagios-monitoring-server/
I have to check some services of Linux Host.
For monitoring linux host and services of that host, I am using nagios log(/usr/local/nagios/var/nagios.log)
First time all good in my nagios log that showing me as below status
SERVICE ALERT: test.testing.local;Service Tomcat;OK;SOFT;6;TOMCAT OK
When my Service status is change to non-OK state than it showing me on log
SERVICE ALERT: test.testing.local;Service Tomcat;CRITICAL;SOFT;4;TOMCAT CRITICAL
But I want that if my Service status is not change to non-OK state than again after 1 minute it show me on log
SERVICE ALERT: test.testing.local;Service Tomcat;OK;SOFT;6;TOMCAT OK
and currently that is not happening.
My services.cfg file content is given below
define service {
host_name test.testing.local
service_description Service Tomcat
check_command check_nrpe!check_service_tomcat
max_check_attempts 10
check_interval 1
retry_interval 1
active_checks_enabled 1
check_period 24x7
register 1
}
I am using Nagios 4.2.2 and CentOS 7.

I think what you are after is from the Nagios 4 Core docs here
check_interval: This directive is used to define the number of "time units" between regularly scheduled checks of the host. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.
retry_interval: This directive is used to define the number of "time
units" to wait before scheduling a re-check of the hosts. Hosts are
rescheduled at the retry interval when they have changed to a non-UP
state. Once the host has been retried max_check_attempts times without
a change in its status, it will revert to being scheduled at its
"normal" rate as defined by the check_interval value. Unless you've
changed the interval_length directive from the default value of 60,
this number will mean minutes. More information on this value can be
found in the check scheduling documentation.
If you were to set your check_interval to 1 minute (which is pretty frequent, the default as you can see is 60) you will retry every 1 minute 10 times (max_check_attempts in your config) without a change in status then it will give you an OK/UP state.

Related

No error when stopping non existing service with chef

Im new to chef and trying to understand why this code does not return any error while if i do the same with 'start' i will get an error for such service does not exist.
service 'non-existing-service' do
action :stop
end
# chef-apply test.rb
Recipe: (chef-apply cookbook)::(chef-apply recipe)
* service[non-existing-service] action stop (up to date)
Don't know which plattform you are running on if you are running on Windows it should at least log
Chef::Log.debug "#{#new_resource} does not exist - nothing to do"
given that you have debug as log level.
You could argue this is the wrong behaviour, but if the service dose not exist it for sure isen't running.
Source code
https://github.com/chef/chef/blob/master/lib/chef/provider/service/windows.rb#L147
If you are getting one of the variants of the init.d provider, they default to getting the current status of a service by grepping the process table. Because Chef does its own idempotence checks internally before calling the provider's stop method, it would see there is no such process in the table and assume it was already stopped.

Renews threshold and Renews (last min)

I am testing a spring-cloud eureka server and client.
I have a simple question about the default configuration (server & client).
On the server side, the renew threshold is equal to 3.
On the client side, it sends a heartbeat every 30 seconds (a maximum of 2 per minute).
When I look at the registry dashboard, when the waitTimeInMswhenSyncEmpty is over, I see the following warning message:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE
When I look at the code, the test getNumOfRenewsInLastMin() <= numberOfRenewsPerMinThreshold is always true (2 <= 3)
Why is the default configuration, it seems weird because it constantly generates a warning!
If there is anyone who can give me an explanation. I think I've missed something…
I have the same problem and investigated it a little bit. The root cause for the warning message is that the renews are exactly 1 lower than the threshold.
It occurs when you start a plain eureka server and do not register any clients. It will get a 1 at the Renews threshold.
When you now register a client, the 2 from the client are just added to the one that is already there. Thus making the Renews threshold now 3 and thats higher than the renews can ever get with one client. Wait a few minutes (about 4min) and the warning will appear.
My application.yml is:
spring:
application:
name: service-registry
server:
port: 8761
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
I'm using Brixton.RC1.
Found two other SQ questions in the same topic:
Understanding Spring Cloud Eureka Server self preservation and renew threshold
Spring Eureka server shows RENEWALS ARE LESSER THAN THE THRESHOLD
Here are some details :
You could find the test allowing to display the message in this file below.
https://github.com/spring-cloud/spring-cloud-netflix/blob/master/spring-cloud-netflix-eureka-server/src/main/resources/templates/eureka/navbar.ftl
The value of the
"isBelowRenewThresold" comes from the code below :
model.put("isBelowRenewThresold", registry.isBelowRenewThresold() == 1);
The invoked method can be found in the following file :
https://github.com/Netflix/eureka/blob/master/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java
Thank you for your help.
Regards,
Stephane
Had the same problems, I tried this in application.properties.
eureka.client.lease.duration=10
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.instance.leaseExpirationDurationInSeconds=2

How to cancel or remove Persistent EJBTimers

When we use persistent EJBTimer with #schedule and persistent=true, deploy it to cluster and then we change the actual schedule within #Schedule and re-deploy to the cluster, does the original schedule get replaced with the new one ( removed and added with new parameters ), or both the schedules remain active ( keeping in mind the persistent=true is set )
This is what I have read so far - Each scheduler instance has a unique jndi name and #schedule automatically creates a timer through application deployment so it would be better to remove the automatic created EJBTimer or cancel the original schedule to avoid trouble. But I don't know how to cancel the original schedule programmatically or would that need to be done by the websphere admins, if both the original and changed schedules remain active
Also from this document, the removeAutomaticEJBTimers command is used to remove timers from a specified scheduler, but that also seems in the area of a websphere admin, not a developer.
How can a developer programmatically cancel an automatic EJBTimer created by using #Schedule annotation?
I am using Java EE 6 with Websphere 8.5 and EJB 3.1.
Do the following to remove persisted EJB timers:
Delete directory jboss-home\standalone\data\timer-service-data{yourporjectname}.{serivename}
See this page Creating timers using the EJB timer service
The application server automatically removes persistent automatic
timers from the database when you uninstall the application while the
server is running. If the application server is not running, you must
manually delete the automatic timers from the database. Additionally,
if you add, remove, or change the metadata for automatic timers while
the server is not running, you must manually delete the automatic
timers.
I have the following class:
#Stateless
#LocalBean
public class HelloBean {
#Schedule(persistent=true, minute="*", hour="*", info="myTimer")
public void printHello() {
System.out.println("### hello");
}
}
When I install it to the server, I can find related automatic timer:
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\bin>findEJBTimers.bat server1 -all
ADMU0116I: Tool information is being logged in file C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\logs\server1\EJBTimers.log
ADMU0128I: Starting tool with the AppSrv02 profile
ADMU3100I: Reading configuration for server: server1
EJB timer : 3 Expiration: 2/14/15 12:39 PM Calendar
EJB : ScheduleTestEAR, ScheduleTest.jar, HelloBean
Info : myTimer
Automatic timer with timout method: printHello
Calendar expression: [start=null, end=null, timezone=null, seconds="0",
minutes="*", hours="*", dayOfMonth="*", month="*", dayOfWeek="*", year="*"]
1 EJB timer tasks found
After uninstalling application, the timer is removed:
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\bin>findEJBTimers.bat server1 -all
ADMU0116I: Tool information is being logged in file
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\logs\server1\EJBTimers.log
ADMU0128I: Starting tool with the AppSrv02 profile
ADMU3100I: Reading configuration for server: server1
0 EJB timer tasks found
I don't know how you are 'redeploying' your applications, but looks like your process is incorrect. As in normal install/uninstall/update process automatic timers are correctly removed.
UPDATE
On the same page you have info regarding ND environment:
Automatic persistent timers are removed from their persistent store
when their containing module or application is uninstalled. Therefore,
do not update applications that use automatic persistent timers with
the Rollout Update feature. Doing so uninstalls and reinstalls the
application while the cluster is still operational, which might cause
failure in the following cases:
If a timer running in another cluster member activates after the database entry is removed and before the database entry is recreated,
then the timer fails. In this case, a
com.ibm.websphere.scheduler.TaskPending exception is written to the
First Failure Data Capture (FFDC), along with the SCHD0057W message,
indicating that the task information in the database has been changed
or canceled.
If the timer activates on a cluster member that has not been updated after the timer data in the database has been updated, then
the timer might fail or cause other failures if the new timer
information is not compatible with the old application code still
running in the cluster member.
In JBoss/WildFly, if you change the timer-service to use a "clustered-store" instead of "default-file-store", you'll be able to programmatically cancel a Timer. Here is a brief guide explaining how to make it:
Mastertheboss.com: Creating clustered EJB 3 Timers
Published: 08 March 2015
http://www.mastertheboss.com/jboss-server/wildfly-8/creating-clustered-ejb-3-timers

Nagios not sending emails

I want to setup nagios to send email notifications.
I can send email notifications manually clicking the "Send custom service notification" in nagios web interface. The notification is being created and the email is being sent and delivered successfully.
But nagios doesn't send notifications automatically. I have tested it turning off PING service on the machine (echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_all). Nagios sets PING service to CRITICAL state, but doesn't send notification email.
These are my config files:
Part of templates.cfg
define contact{
name generic-contact ; The name of this contact template
service_notification_period 24x7 ; service notifications can be sent anytime
host_notification_period 24x7 ; host notifications can be sent anytime
service_notification_options w,u,c,r,f,s ; send notifications for all service states, flapping events, and scheduled downtime events
host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events
service_notification_commands notify-service-by-email ; send service notifications via email
host_notification_commands notify-host-by-email ; send host notifications via email
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
}
Part of contacts.cfg
define contact{
contact_name nagiosadmin ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
email MY-EMAIL#gmail.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosadmin
}
generic-host_nagios2.cfg
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
check_command check-host-alive
max_check_attempts 10
notification_interval 1
notification_period 24x7
notification_options d,u,r,f,s
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
generic-service_nagios2.cfg
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_interval 1 ; Only send notifications on status change by default.
is_volatile 0
check_period 24x7
normal_check_interval 5
retry_check_interval 1
max_check_attempts 4
notification_period 24x7
notification_options w,u,c,r,f,s
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
How can I force nagios to send notification emails?
I had a similar issue. It turned out to be a combination of two problems:
1) I was not waiting long enough for the alerts. Add up your normal_check_interval and retry_check_interval*max_check_attempts for services and you'll see that you must wait as long as 9 minutes before getting a notification. Decrease the normal_check_interval and max_check_attempts if you must know about failures of a service faster. Note that with the default Nagios configuration it may be as much as 15 minutes before it will notify you.
2) The default configuration for linux-server is to notify you during workhours, but the server in question was on UTC. Make sure that notification_period variable everywhere is 24x7.
Good luck.
Check the main nagios.cfg file and make sure notifications_enabled=1. Also verify that your base contact template has notifications_enabled=1 as well.
Please also consider the flap_detection_enabled setting. If this setting is enabled and flapping is determined to exist, Nagios will not notify. Flapping is the idea that a service is frequently changing between OK, soft and normal state. During testing, it's often common for a service to "flap" as we test to ensure everything works. Disable the flap_detection_enabled setting in both your host and service config files during testing.

Can I open a clustered MQ queue for writing in Perl?

If I have a Websphere MQ queue defined on another queue manager in the cluster, is there a way I can open it for writing using the Perl interface? The code below brings back mqrc 2085.
$messageQ = MQSeries::Queue->new
(
QueueManager => $qMgr,
Queue => $queue,
Options => $openOpt
) or die ">>>ERROR2: Unable to open the queue: $queue\n";
}
Yes! The Perl modules are a thin veneer over the WMQ API and expose all the basic options and most of the really esoteric stuff as well.
When you open a queue, WebSphere MQ performs name resolution on the values you provide for Queue and QMgr names. If you provide both a Queue and a QMgr name then the object reference is fully qualified and WMQ will attempt to open it as named. So if the name you provide is the local QMgr and the clustered queue does not have a locally defined instance, the open will fail with a 2085 Unknown Object Name.
The trick to opening a clustered queue is to provide a null value for the QMgr name. This causes name resolution to check the local QMgr for a queue of the same name, then finding none it checks the cluster repository and resolve the open to the clustered queue. Note that the queue must be advertised to the cluster for this to work. Specifically, the CLUSTER or CLUSNL attribute of the target queue must be non-blank and refer to a cluster that the source QMgr participates in. Similarly, the destination QMgr must also participate in the same cluster as the source QMgr.
Note also that if you specify a QMgr name on the open that is not the local QMgr, then WMQ will attempt to resolve the QMgr name only. If it can resolve a route to that QMgr then it will send the message there. This means that in a cluster you can send a message to any queue on any QMgr so long as you know the fully-qualified name.
Finally, you can define a local alias over a clustered queue. For example if you are on QMGRA and DEF QA(TARGET.QUEUE) TARGQ(TARGET.QUEUE) and then on QMGRB and QMGRC in the same cluster you DEF QL(TARGET.QUEUE) CLUSTER(MYCLUS) then it is possible to open QMGR=QMGRA QUEUE=TARGET.QUEUE and still have it work as expected. Note that the alias is NOT advertised to the cluster but the target queue is. The only issue with this approach is that the first time it is opened the API call may fail if the cluster query takes too long. When I do this in Production, I always use amqsput on the alias ahead of time to make the QMgr query the repository before the actual application opens the queue. Why would you do this? If security is a concern you probably don't want to authorize all apps directly to the cluster XMitQ because, as noted above, they could then put a message onto any queue on any QMgr in the cluster, including SYSTEM.ADMIN.COMMAND.QUEUE. The alias gives you a place to hang authorizations and restrict the user to specific destinations in the cluster.
So short answer, make sure you provide a null QMgr name on the Open call or set up a local alias over the clustered queue. For more about the security aspects of this, see the WMQ Security presentation at http://t-rob.net/links