Nagios: Host goes Down -> Up after that all Services seems to timeout

Nagios: Host goes Down -> Up after that all Services seems to timeout - service

i have a strange Problem with Nagios. After restart everything runs perfectly fine.
Then some hours later, Hosts are shown down and a minute later up again(see History log below). After that all Services fail with a timeout.
This doesn´t happen with all Servers at the same time. It seems rather randomly which Server fails.
History log:
[2013-06-26 19:19:07] SERVICE ALERT: HyperV 1;Check CPU HyperV 1;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 120 seconds.
[2013-06-26 19:17:27] HOST ALERT: HyperV 1;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 3.01 ms
[2013-06-26 19:16:17] HOST ALERT: HyperV 1;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
What i have tried so far.
-Increased the timeouts
-Changed the Host check, so that it get checked more often before fail (5 times instead of 1)
-Executed the scripts from command line -> Also fail (maybe Ubuntu problem?)
-Checked Logs on both sides for errors (nothing found)
After a restart everything is fine again.
System Infos:
-Nagios is running on an Ubuntu 13.04
-Some clients are running different Windows with NSClient++
-ESX with Versions from 4.0 to 5.1
Plugins:
-check_nrpe
-check_vmfs from Nagios Exchange
I sth. is unclear don´t hesitate to ask.
Thx & Best,
Pille

You seem to have a networking issue, not a Nagios issue. Possibly a bad cable, failing NIC, routing problem, switch flapping, arp table overflow, could be any number of things.
Since this affects all hosts/services, and intermittently, and clears itself up, I would suggest you start looking for a problem on your local connections first. If it only affects some items and not others, then find which hosts have common network components and check there.

Related

Local web server on windows stopped being reachable by devices on the same network

I use a local Python web server on my Windows machine. It’s simple, but good enough while in the static web page development stage. I just run it with something like this on my WSL command line:
python3 -m http.server
I can also access it on mobile devices on the same network, by going to my local address, e.g.: http://192.168.1.12:8000. All was good, until suddenly I could no longer access it on external devices, I got a “server not responding” type of message. Also, I could clearly see that when I refreshed the page on my phone, there was no GET request on the logs.
Immediately I tested on the local machine, and it was still working fine. This obviously smelled like a Firewall. In Linux, I’d know what to do, but it’s the first time I had to deal with this on Windows. This is what I’ve tried, without resolving the connection problem:
I opened the Event Viewer but could not see any obvious logs to check
I stopped the server (CTRL+C) and started it again on another port (5000). The Windows Firewall message popped up again asking for permission for Python3 to access the “Public network” and the “Private network”. Normally I just tick the “private network” but this time I checked both, as a troubleshooting step, in case my Wi-Fi was incorrectly being considered “public”.
I went to Windows Firewall and temporarily shut it down on the private network.
I installed and tried running nmap on the WSL, but it failed to run and prompted me to install the Windows version instead.
I installed and ran the Windows version of nmap but it told me that port 5000 was open.
What is the recommended way to troubleshoot and fix this issue?

Still suspecting the firewall, I tried something new, I switched off the “public network” firewall. I tested on my mobile and the page loaded as normal again! I immediately turned the firewall back on. Tested the page on my mobile once more, still fine. So, the solution was to toggle the public network firewall. I would make it more generic and toggle all firewall categories on Windows. And of course, I would make sure that the firewall stays on, this was a very quick operation.
I thought I’d put this here rather than ServerFault or SuperUser as it could potentially be more useful to developers, and it took a precious hour of my time. I still don’t know why it stopped working on its own in the first place. Better troubleshooting steps or suggestions are welcome, but I probably won’t be able to verify it as I don’t know how to purposely induce the issue.
Another solution that worked another time, was to delete all instances of Python 3.8 from the list of allowed apps (I don't know why Windows shows the same app multiple times) then (re)start the Python server and allow it through when the Firewall question pops up again.

In windows firewall you may have 4 options to configure your local web server when you are creating new Inbound connections rule.
1 Program
2 Port
3 Predefined
4 Custom
Try to use port only in "TCP protocol" and the custom port.
Allow connection.
Select: all checks: domain, private and public.
Enter a name.
Thats all.

avahi works only few minutes

I use avahi v0.7 compiled with buildroot 2019.02 to armv7 target.
The compilation works fine.
After service start or system start, I can ping (from other computer) my device with ping wpb-the.local. But after few minutes 1 or 2 (on the same computer) I got ping: wpb-the.local: Name or service not known from Ubuntu or Ping request could not find host wpb-the.local. Please check the name and try again. from windows.
On device side the daemon still running and there is no error in messages.
And if I stop the service and restart it, then it's working during few minutes again.
I don't understand why. Do you have any hint ?
Best regards
JM

Tableau Server v2018.2 refusing to use port 80 despite it being open

I have a Windows Server 2016 that used to run Tableau Server v2018.1 (and a few versions before that); during this last update, I performed a backup and continued to wipe Tableau off the server (used the tableau-obliterate script which removed all things Tableau).
I then proceeded to install Tableau v2018.2 as a clean install, set up the configuration to use port 80 and started the server successfully.
However, I quickly discovered that Tableau moved the gateway to port 8000; I proceeded to review the ports to ensure nothing else is using this (this VM has nothing other than Tableau installed on it); I used TCPView and monitored the ports while the Tableau Server was running and Stopping/Starting; the only hint I found of something touching port 80 was the output of netstat, which showed an entry of TCP vizqlserver.exe with the state of CLOSE_WAIT.
I have tried manually setting the port through TSM configuration (run set, confirm with get, restart), TSM Settings import, and manually adjusting the configuration file for gateway, but Tableau just reverts back to port 8000.
I am at a loss as to why this is happening as again, nothing else has ever been on this server and nothing has changed since removing v2018.1 (which was running on port 80).
I tried to post this on the Tableau community forum, but 20 hrs later, it is still pending moderator approval :(
Would appreciate any help!

A recent Windows update has been causing some port conflicts try this:
https://kb.tableau.com/articles/Issue/kb4338818-windows-update-causing-tableau-server-to-become-unstable

Zend project not working on XAMPP

I have just created zend project on my local machine. but when I try to run it in the browser, it just loads for at least a minute and then shows this error.
Fatal error: Maximum execution time of 30 seconds exceeded in /opt/lampp/htdocs/launchmind/library/Zend/Db/Adapter/Abstract.php on line 815
It displays the same error with some other line number on some other file every time I reload the page.
Please help. Thank you.

Most likely you have some firewall issue that causes the script to timeout. Check connections from you local machine to all the database hosts in the config and/or webservices and/or other involved servers.
Here are few tips:
Monitor xampp error_log and /var/log/messages...
Since it's local server, disable the firewall temporarily to see if the problem is with your local firewall (preventing outgoing connection) or remote server firewall (preventing incoming connection). On RHEL use sudo service iptables stop or /etc/init.d/iptables stop
if you have very time consuming tasks in your app (very unlikely) then you can try to bump up the max execution time either by modifying /opt/lampp/etc/php.ini (somewhere there) or set max execution time in your app config phpSettings.max_execution_time = 60 or use ini_set in bootstrap
Hope this helps.

How do I know if a system has powered on?

I am writing a script that powers on a system via network. And then i need to run a few commands on the other host. How do I know whether the system has powered on?
My programming language is Perl and the target host is RHEL5.
Is there any kernel interrupt or network boot information that indicates the system has powered on and the os has loaded?
[In a different scenario] I was also wondering just in case if i just switch on my Machine manually. when is it exactly said to have powered on. and when is the OS is supposed to have booted completely for a network related operation such as executing a network command there. What if the system is on DHCP how would a remote system then search for this machine [i guess it is possible via mac address. but if i am wrong ].
If I have missed out any info please feel free to ask me. If you have any suggestions to make the task easier please surface them :)
thanx
imkin

Well, I'd say the system is booted when it can perform the request you've made of it. That is, the sshd daemon is running. That's booted sufficiently for your purposes (I assume - substitute for whatever daemon you really need).
So, I'd send the power-on signal, and check back every 15-30 seconds to see if I could connect. If I've failed to connect within whatever is a reasonable time for that machine (2 minutes or 5 minutes or whatever), then I'd send an alert to the IT support team. Well, I'd send it to myself first, and only once I've investigated a few failures or so and found them to all be legitimate would I start sending it directly to IT.
DHCP is kind of a different question. You'd have to start learning about broadcasting, or having a daemon on that machine "call home" during boot to register its current IP address. And it would have to "call home" every time a DHCP renewal changed its IP address. This is decidedly more convoluted. Try to avoid DHCP on such server machines if at all possible.

On the rebooting machine you can install a script in your crontab with the special #reboot assertion (see man 5 crontab). That script could send a notification of some kind to the other machine, notifying it that it's up now.

I think checking for sshd sounds like a good approach.
As for the DHCP problem: if the other computer is on the same subnet you can look it up by MAC address using Net::ARP.

How about adding a script to the remote machine which gets run on startup to have it tell you when it is ready.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Nagios: Host goes Down -> Up after that all Services seems to timeout - service

Related

Local web server on windows stopped being reachable by devices on the same network

avahi works only few minutes

Tableau Server v2018.2 refusing to use port 80 despite it being open

Zend project not working on XAMPP

How do I know if a system has powered on?

Categories

Resources