Nagios event handler ignoring check interval - event-handling

I have recently created an event handler for a service check which will restart Tomcat on 3 different boxes.
The check settings are:
5 checks
2 Minute checks when Ok
5 Minute checks otherwise
In the event handler script I have:
# What state is the iOS PN in?
case "$1" in
OK)
# The service is ok, so don't do anything...
;;
WARNING)
# Is this a "soft" or a "hard" state?
case "$2" in
SOFT)
case "$3" in
#Check number
2)
echo "`date` Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)..." >> /tmp/iOSPN.log
;;
3)
echo "`date` Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)..." >> /tmp/iOSPN.log
;;
4)
echo "`date` Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)..." >> /tmp/iOSPN.log
;;
esac
;;
HARD)
# Do nothing let Nagios send alert
;;
esac
;;
CRITICAL)
# In theory nothing should reach this point...
;;
esac
exit 0
So the event handler should restart Tomcat on node 1 after the 2nd warning check, wait 5 minutes before checking again, then restart node 2 if it is still an issue, then wait 5 minutes and check again then restart node 3 if it is still an issue.
However when I check the log file I can see the following:
Thu Apr 18 15:09:13 2019 Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)...
Thu Apr 18 15:09:23 2019 Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)...
Thu Apr 18 15:09:33 2019 Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)...
As you can see it would have restarted each box after 10 seconds not 5 minutes, I have removed the lines which actually call the restart of Tomcat as this cannot be done in this short amount of time.
I cannot see anything in the Nagios logs detailing why it did the next check so quickly after, so any help would be appreciated.
Additional:
This is the service definition:
define service{
use 5check-service
host_name ACTIVEMQ1
contact_groups tyrell-admins-non-critical
service_description ActiveMQ - iOS PushNotification Queue Pending Items
event_handler restartRemote_Tomcat!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
check_command check_activemq_queue_item2!http://activemq1:8161/admin/xml/queues.jsp!IosPushNotificationQueue!100!300
}
define service{
name 5check-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 5 ; Re-check the service up to 5 times in order to determine its final (hard) state
normal_check_interval 2 ; Check the service every 5 minutes under normal conditions
retry_check_interval 5 ; Re-check the service every two minutes until a hard state can be determined
contact_groups support ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every 5 mins
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

Related

fail2ban - how to ban ip permanently after it was baned 3 times temporarily

Have set up fail2ban service on CentOS 8 by this tutorial: https://www.cyberciti.biz/faq/how-to-protect-ssh-with-fail2ban-on-centos-8/.
I have set up settings similiarly according to tutorial above like this:
[DEFAULT]
# Ban IP/hosts for 24 hour ( 24h*3600s = 86400s):
bantime = 86400
# An ip address/host is banned if it has generated "maxretry" during the last "findtime" seconds.
findtime = 1200
maxretry = 3
# "ignoreip" can be a list of IP addresses, CIDR masks or DNS hosts. Fail2ban
# will not ban a host which matches an address in this list. Several addresses
# can be defined using space (and/or comma) separator. For example, add your
# static IP address that you always use for login such as 103.1.2.3
#ignoreip = 127.0.0.1/8 ::1 103.1.2.3
# Call iptables to ban IP address
banaction = iptables-multiport
# Enable sshd protection
[sshd]
enabled = true
I would like an ip to be baned permanently after it was baned 3 times temporarily. How to do that?
A persistent banning is not advisable - it simply unnecessarily overloads your net-filter subsystem (as well as fail2ban)... It is enough to have a long ban.
If you use v.0.11, you can use bantime increment feature, your config may looks like in this answer - https://github.com/fail2ban/fail2ban/discussions/2952#discussioncomment-414693
[sshd]
# initial ban time:
bantime = 1h
# incremental banning:
bantime.increment = true
# default factor (causes increment - 1h -> 1d 2d 4d 8d 16d 32d ...):
bantime.factor = 24
# max banning time = 5 week:
bantime.maxtime = 5w
But note if this feature is enabled, it would also affect maxretry, so 2nd and following bans from known as bad IPs occur much earlier than after 3 attempts (it'd be halved each time).
You can use jail [recidive] with bantime = -1 for permanent ban. Example jail.local:
# Jail for more extended banning of persistent abusers
# !!! WARNINGS !!!
# 1. Make sure that your loglevel specified in fail2ban.conf/.local
# is not at DEBUG level -- which might then cause fail2ban to fall into
# an infinite loop constantly feeding itself with non-informative lines
# 2. Increase dbpurgeage defined in fail2ban.conf to e.g. 648000 (7.5 days)
# to maintain entries for failed logins for sufficient amount of time
[recidive]
enabled = true
logpath = /var/log/fail2ban.log
banaction = %(banaction_allports)s
bantime = -1 ; permanent
findtime = 86400 ; 1 day
maxretry = 6
General note:
Use SSH key auth and set "AllowGroups" or "AllowUsers" in sshd_config. Most SSH login attempts will stop after a few tries. I also notice on my servers that it is getting less and less after months or years.

Kubernetes 1.15.5 and romana 2.0.2 getting network errors when ANY pods added or removed

I have encountered some mysterious network errors in our kubernetes cluster. Although I originally encountered these errors using ingress, there are even more errors when I bypass our load balancer, bypass kube-proxy and bypass nginx-ingress. The most errors are present when going directly to services and straight to the pod IPs. I believe this is because the load balancer and nginx have some better error handling than the raw iptable routing.
To test the error I use apache benchmark from VM on same subnet, any concurrency level, no keep-alive, connect to the pod IP and use a high enough request number to give me time to either scale up or scale down a deployment. Odd thing is it doesn't matter at all which deployment I modify since it always causes the same sets of errors even when its not related to the pod I am modifying. ANY additions or removals of pods will trigger apache benchmark errors. Manual deletions, scaling up/down, auto-scaling all trigger errors. If there are no pod changes while the ab test is running then no errors get reported. Note keep-alive does seem to greatly reduce if not eliminate the errors, but I only tested that a handful of times and never saw an error.
Other than some bizarre iptable conflict, I really don't see how deleting pod A can effect network connections of pod B. Since the errors are brief and go away within seconds it seems more like a brief network outage.
Sample ab test: ab -n 5000 -c 2 https://10.112.0.24/
Errors when using HTTPS:
SSL handshake failed (5).
SSL read failed (5) - closing connection
Errors when using HTTP:
apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)
Example ab output. I ctl-C after encountering first errors:
$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C
Server Software: nginx
Server Hostname: 10.112.0.24
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 2575 bytes
Concurrency Level: 2
Time taken for tests: 21.670 seconds
Complete requests: 1824
Failed requests: 2
(Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred: 5142683 bytes
HTML transferred: 4694225 bytes
Requests per second: 84.17 [#/sec] (mean)
Time per request: 23.761 [ms] (mean)
Time per request: 11.881 [ms] (mean, across all concurrent requests)
Transfer rate: 231.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 15 9.8 12 82
Processing: 1 9 9.0 6 130
Waiting: 0 8 8.9 6 129
Total: 7 23 14.4 19 142
Percentage of the requests served within a certain time (ms)
50% 19
66% 24
75% 28
80% 30
90% 40
95% 54
98% 66
99% 79
100% 142 (longest request)
Current sysctl settings that may be relevant:
net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050 65500
I didn't see any conntrack "full" errors. Best I could tell there isn't packet loss. We recently upgraded from 1.14 and didn't notice the issue but I can't say for certain it wasn't there. I believe we will be forced to migrate away from romana soon since it doesn't seem to be maintained anymore and as we upgrade to kube 1.16.x we are encountering problems with it starting up.
I have searched the internet all day today looking for similar problems and the closest one that resembles our problem is https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 but I have no idea how to implement the iptable masquerade --random-fully option given we use romana and I read (https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153) that random-fully is the default for linux kernel 5 which we are using. Any ideas?
kubernetes 1.15.5
romana 2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
====== Update Nov 5, 2019 ======
It has been suggested to test an alternate CNI. I chose calico since we used that in an older Debian based kube cluster. I rebuilt a VM with our most basic Centos 7 template (vSphere) so there is a little baggage coming from our customizations. I can't list everything we customized in our template but the most notable change is the kernel 5 upgrade yum --enablerepo=elrepo-kernel -y install kernel-ml.
After starting up the VM these are the minimal steps to install kubernetes and run the test:
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum -y install docker-ce-3:18.09.6-3.el7.x86_64
systemctl start docker
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0
systemctl enable --now kubelet
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: test
spec:
selector:
matchLabels:
app: test
replicas: 1
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginxdemos/hello
ports:
- containerPort: 80
EOF
# wait for control plane to become healthy
kubectl apply -f /tmp/test-deploy.yml
Now the setup is ready and this is the ab test:
$ docker run --rm jordi/ab -n 100 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed
The ab test gives up after this error. If I decrease the number of requests to see avoid the timeout this is what you would see:
$ docker run --rm jordi/ab -n 10 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient).....done
Server Software: nginx/1.13.8
Server Hostname: 192.168.4.4
Server Port: 80
Document Path: /
Document Length: 7227 bytes
Concurrency Level: 1
Time taken for tests: 0.029 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 74140 bytes
HTML transferred: 72270 bytes
Requests per second: 342.18 [#/sec] (mean)
Time per request: 2.922 [ms] (mean)
Time per request: 2.922 [ms] (mean, across all concurrent requests)
Transfer rate: 2477.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 1 3
Processing: 1 2 1.2 1 4
Waiting: 0 1 1.3 0 4
Total: 1 3 1.4 3 5
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 4
80% 5
90% 5
95% 5
98% 5
99% 5
100% 5 (longest request)
This issue is technically different than the original issue I reported but this is a different CNI and there are still network issues. It does have the timeout error in common when I run the same test in the kube/romana cluster: run the ab test on the same node as the pod. Both encountered the same timeout error but in romana I could get a few thousand requests to finish before hitting the timeout. Calico encounters the timeout error before reaching a dozen requests.
Other variants or notes:
- net.netfilter.nf_conntrack_tcp_be_liberal=0/1 doesn't seem to make a difference
- higher -n values sometimes work but it is largely random.
- running the 'ab' test at low -n values several times in a row can sometimes trigger the timeout
At this point I am pretty sure it is some issue with our centos installation but I can't even guess what it could be. Are there any other limits, sysctl or other configs that could cause this?
====== Update Nov 6, 2019 ======
I observer that we had an older kernel installed in so I upgraded my kube/calico test VM with the same newer kernel 5.3.8-1.el7.elrepo.x86_64. After the update and a few reboots I can no longer reproduce the "apr_pollset_poll: The timeout specified has expired (70007)" timout errors.
Now that the timeout error is gone I was able to repeat the original test where I load test pod A and kill pod B on my vSphere VMs. On the romana environments the problem still existed but only when the load test is on a different host than where the pod A is located. If I run the test on the same host, no errors at all. Using Calico instead of romana, there are no load test errors on either host so the problem was gone. There may still be some setting to tweak that can help romana but I think this is "strike 3" for romana so I will start transitioning a full environment to Calico and do some acceptance testing there to ensure there are no hidden gotchas.
You mentioned that if there are no pod changes while the ab test is running, then no errors get reported. So it means that errors occur when you add pod or delete one.
This is normal behaviour as when pod gets deleted; it takes time for iptable rules changes to propagate. It may happen that container got removed, but iptable rules haven't got changed yet end packets are being forwarded to the nonexisting container, and this causes errors (it is sort of like a race condition).
The first thing you can do is always to create readiness probe as it will make sure that traffic will not be forwarded to the container until it is ready to handle requests.
The second thing to do is to handle deleting the container properly. This is a bit harder task because it may be handled at many levels, but the easiest thing you can do is adding PreStop hook to your container like this:
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 5"
PreStop hook gets executed at the moment of the pod deletion request. From this moment, k8s start changing iptable rules and it should stop forwarding new traffic to the container that's about to get deleted. While sleeping you give some time for k8s to propagate iptable changes in the cluster while not interrupting already existing connections. After PreStop handle exits, the container will receive SIGTERM signal.
My suggestion would be to apply both of these mechanisms together and check if it helps.
You also mentioned that bypassing ingress is causing more errors. I would assume that this is due to the fact that ingress has implemented retries mechanism. If it's unable to open a connection to a container, it will try several times, and hopefully will get to a container that can handle its request.

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.

Can't obtain database connection within 5 seconds - Sidekiq workers, Unicorn, Redis ToGo on Heroku

I've a bunch background tasks (Sidekiq workers) that update database, and I keep getting this failing thread exception.
Heroku Log: WARN: could not obtain a database connection within 5.000 seconds (waited 5.001 seconds)
Heroku
Heroku Postgres :: Olive - 20 connections limit.
Redis ToGo - 10 connections limit.
Sidekiq - 2 connections.
Each client request create ~50 threads - finally ~20 threads trying to update db.
Now I know this is too much threads trying to make connection (updating Active:Record..).
I don't mind them to wait in try again until success.
-config/unicorn.rb
worker_processes Integer(ENV["WEB_CONCURRENCY"] || 3)
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
if defined?(ActiveRecord::Base)
config = ActiveRecord::Base.configurations[Rails.env] ||
Rails.application.config.database_configuration[Rails.env]
config['pool'] = ENV['DB_POOL'] || 2
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
ActiveRecord::Base.establish_connection(config)
end
end
-config/initializers/sidekiq.rb
require 'sidekiq'
Sidekiq.configure_server do |config|
if(database_url = ENV['DATABASE_URL'])
p pool_size = Sidekiq.options[:concurrency] + 2
p ENV['DATABASE_URL'] = "#{database_url}?pool=#{pool_size}"
ActiveRecord::Base.establish_connection
end
end
--condig/sidekiq.yml
:concurrency: 2
Thanks a lot for all the help,
Eldar

app on bluemix does not start up says 0 of 1 instance running

My app on bluemix does not start up says 0 of 1 instance running, how do I fix it?
Starting app mytwitinfluapp in org xyz#in.ibm.com / space dev as xyz#in.ibm.com...
OK
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 down
0 of 1 instances running, 1 failing
FAILED
Start unsuccessful
logs indicate the following
2014-08-25T12:37:38.31+0530 [DEA] OUT Instance (index 0) failed to start accepting connections
2014-08-25T12:38:06.79+0530 [DEA] OUT Removing crash for app with id e7c454db-1d71-486d-ae8c-1fce17b978ec
2014-08-25T12:38:06.79+0530 [DEA] OUT Stopping app instance (index 0) with guid e7c454db-1d71-486d-ae8c-1fce17b978ec
2014-08-25T12:38:06.79+0530 [DEA] OUT Stopped app instance (index 0) with guid e7c454db-1d71-486d-ae8c-1fce17b978ec
2014-08-25T12:42:46.15+0530 [DEA] OUT Removing crash for app with id e7c454db-1d71-486d-ae8c-1fce17b978ec
2014-08-25T12:42:46.15+0530 [DEA] OUT Stopping app instance (index 0) with guid e7c454db-1d71-486d-ae8c-1fce17b978ec
2014-08-25T12:42:46.15+0530 [DEA] OUT Stopped app instance (index 0) with guid e7c454db-1d71-486d-ae8c-1fce17b978ec
The key error message in the log output is this one:
2014-08-25T12:37:38.31+0530 [DEA] OUT Instance (index 0) failed to start accepting connections
The message means that your app is either listening on the wrong port (as jsloyer stated) or you have pushed an app that doesn't listen on a port but you didn't set the --no-route option.
What is happening under the covers is the health manager is polling the route (url) that has been configured for your app to determine if the application is still alive. As no response came back from your app then it is killed that instance of the app.
This usually indicates some error in your code (app.js, manifest.yml) or code in any other language. I found the following technique very useful to debug this situation by using the cf command
cf logs app-name --recent
This will dump the logs when you try to push your app onto Bluemix. You can use the above command also if your app crashes suddenly while executing.
Checkout my post Spicing up IBM Bluemix cloud app with MongoDB and NodeExpress
I used this technique many times while debugging my app
This is usually caused by a runtime error such as binding to the wrong port to an error in the startup of your app, as someone mentioned above can you please post the output of cf logs mytwitinfluapp --recent?
The info you have provided is insufficient to diagnose.I assume that you are using cf command
prompt.for detail analysis about the failure,please provide log file of recent run
If you are using something like Java verify that your MANIFEST.MF file is correct according with the bluemix domain. Also if you are using other runtime different to Java verify that your app has been set correctly to point you bluemix environment.
This article helped to understand how to deploy applications in bluemix:
http://www.ibm.com/developerworks/java/library/j-hangman-app/index.html