Port exhaustion and random process - sockets

I am investigating a network problem on a Windows Server 2016.
Around once a week, all outgoing ports are used up, which means that various network components no longer function properly. If this happens you will find these warnings in the Windows Eventlog:
TCP / IP warning: 4231
"A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use".
TCP / IP warning: 4227
"TCP / IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP / IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP / IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint ".
That looked like a typical handle / socket leak for me and I tried to find the process with "netstat -anobq" to which the connections can be assigned.
This is a very short list of connections in WAIT state:
  ...
  TCP 192.168.24.40:49814 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:49833 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:49880 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50167 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50185 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50236 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50713 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50718 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50725 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50798 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50837 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50887 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:51308 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:51336 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:51360 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:51380 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:51427 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:51487 192.168.24.40:49670 WAIT 0
 [Explorer.exe]
There were a huge list (> 1000) of connections in WAIT state. In this example the process seems to be explorer.exe but if I run the same command a few minutes later the open process is a different one. I captured Firefox.exe, SSHd, Windows Telemetry Service and many other processes with a huge list of wait connections.
The second strange thing is that 90% of these connections point to 192.168.24.10:5432. A Postgres DB runs on 5432 on this server. But FireFox, Explorer.exe and more than 10 other processes do not access this DB.
It looks like netstat is wrong, and the connections belong to another process. Is this even possible?
Windows Defender is running on this server and additionaly I did a scan with Panda Antivirus. The server seems to be clean.
I could lower the wait timeout to close them earlier or increase the number of outgoing connections (currently ~ 16000). But I think that will only move the problem for a few days.
Do you guys have any advice on what checking next?

Related

What does "Pool is now shutting down as requested" mean when using host connection pools

I have a few streams that wake every min or so and pulling some docs from the DB and performing some actions and in the end sending messages to SNS.
The tick interval is every 1 min currently.
Every few minutes I see this error info in the log:
[INFO] [06/04/2020 07:50:32.326] [default-akka.actor.default-dispatcher-5] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool is now shutting down as requested.
[INFO] [06/04/2020 07:51:32.666] [default-akka.actor.default-dispatcher-15] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool shutting down because akka.http.host-connection-pool.idle-timeout triggered after 30 seconds.
What does it mean? Did someone have it before? 443 was worrying me.
Akka http connection pools are terminated by akka automatically if not used for a certain time (default is 30 seconds). This can be configured and set to infinite if needed.
The pools are re-created on next use but this takes some time, so the request initiating the creation will be "blocked" till the pool is re-created.
From documentation.
The time after which an idle connection pool (without pending requests) will automatically terminate itself. Set to infinite to completely disable idle timeouts.
The config parameter that controls it is
akka.http.host-connection-pool.idle-timeout
The log message points to the config parameter too
Pool shutting down because akka.http.host-connection-pool.idle-timeout
triggered after 30 seconds.

Too many connections on zookeper server

Environment: HDP 2.6.4
Ambari – 2.6.1
3 zookeeper server
23.1.35.185 - is the IP of the first zookeeper server
hi all,
In the first zookeeper server it seems that even after closing the connection to zookeeper is not getting closed,
which causes the maximum number of client connections to be reached from a host - we have maxClientCnxns as 60 in zookeeper config
As a result when a new application comes and tries to create a connection it fails.
Example when Connections are:
echo stat | nc 23.1.35.185 2181
Latency min/avg/max: 0/71/399
Received: 3031 Sent: 2407
Connections: 67
Outstanding: 622
Zxid: 0x130000004d
Mode: follower
Node count: 3730
But after some time when connection comes to ~70 we see
echo stat | nc 23.1.35.185 2181
Ncat: Connection reset by peer.
And We can see also many CLOSE_WAIT
java 58936 zookeeper 60u IPv6 381963738 0t0 TCP Zookeper_server.sys54.com:eforward->zookeper_server.sys54.com:44983 (CLOSE_WAIT)
From the zookeeper log
2018-12-26 02:50:46,382 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193]
- Too many connections from /23.1.35.185 - max is 60
In the ambari we can see also
Connection failed: [Errno 104] Connection reset by peer to zookeper_server.sys54.com.:2181
I must to say that this not happening on zookeeper servers 2 and 3
NOTE - if we increase the maxClientCnxns to 300 , its not help because after some time we get more the 300 connections ( CLOSE_WAIT ) and then we see from the log
2018-12-26 02:50:49,375 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193] - Too many connections from /23.1.35.187 - max is 300
so any hint why the connection are CLOSE_WAIT ?
CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.
The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.
Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).
There is a kernel level property to reuse the connection and reduce the CLOSE_WAIT time.
I suggest you to follow this tutorial http://www.linuxbrigade.com/reduce-time_wait-socket-connections/
This should probably solve your problem.

Haproxy backend stays down and is never brought up again

It works fine up until the moment the remote server becomes unavailable for some time. In which case the server goes down in the logs and is never brought up again. Config is quite simple:
defaults
retries 3
timeout connect 5000
timeout client 3600000
timeout server 3600000
log global
option log-health-checks
listen amazon_ses
bind 127.0.0.2:1234
mode tcp
no option http-server-close
default_backend bk_amazon_ses
backend bk_amazon_ses
mode tcp
no option http-server-close
server amazon email-smtp.us-west-2.amazonaws.com:587 check inter 30s fall 120 rise 1
Here are the logs when the problem occurs:
Jul 3 06:45:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30004ms, status: 119/120 UP.
Jul 3 06:46:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30003ms, status: 118/120 UP.
Jul 3 06:47:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30002ms, status: 117/120 UP.
...
Jul 3 08:44:36 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30000ms, st
atus: 0/1 DOWN.
Jul 3 08:44:36 jupiter haproxy[40331]: Server bk_amazon_ses/amazon is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued,
0 remaining in queue.
Jul 3 08:44:36 jupiter haproxy[40331]: backend bk_amazon_ses has no server available!
And that's it. Nothing but operator intervention brings the server back to life. I also tried removing the check part and what follows it - still the same thing happens. Can't haproxy be configured to try indefinitely and not mark a server DOWN? Thanks.

Logs being flooded tsa.connection.tcpip-forward.forward-conn.io-complete

We are being flooded by this in the ATC logs over 2000 message per second.
{"timestamp":"1488987647.707217455","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.closing-forwarded-tcpip","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75620"}}
{"timestamp":"1488987647.707637787","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.waiting-for-tcpip-io","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
{"timestamp":"1488987647.735419750","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.io-complete","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
{"timestamp":"1 488987647.735453606","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.done-waiting","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
We are running 8 workers and concourse 2.7.0 and we have over 9000 of TIME_WAIT connection.
Any ideas what and why this is happening ?

What happens when TCP keepalive probe failed on an established socket?

To be more clear, I am wondering what TCP would a socket be in when keepalive probe failed too many times?
For example right now I have the following for netstat -anop
tcp 0 0 10.10.10.10:12345 11.11.11.11:56789 ESTABLISHED 12345/process keepalive (7200.00/0/0)
Let's say host 11.11.11.11 suddenly lost power forever, and host 10.10.10.10's keepalive probe will eventually detect the broken connection. When 10.10.10.10 detects the broken connection, what will the socket's state be in as shown by netstat?
The connection will be reset and the line will disappear from netstat.