Logs being flooded tsa.connection.tcpip-forward.forward-conn.io-complete - concourse

We are being flooded by this in the ATC logs over 2000 message per second.
{"timestamp":"1488987647.707217455","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.closing-forwarded-tcpip","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75620"}}
{"timestamp":"1488987647.707637787","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.waiting-for-tcpip-io","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
{"timestamp":"1488987647.735419750","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.io-complete","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
{"timestamp":"1 488987647.735453606","source":"tsa","message":"tsa.connection.tcpip-forward.forward-conn.done-waiting","log_level":0,"data":{"remote":"172.16.4.61:55834","session":"6.3.75621"}}
We are running 8 workers and concourse 2.7.0 and we have over 9000 of TIME_WAIT connection.
Any ideas what and why this is happening ?

Related

Can syslog pri value can be negative?

First i will tell you my architecture
client--->haproxy--->syslog-ng--->kafka
the client is Cisco ASA and haproxy is server for load-balancing and syslog-ng is for receiving ,filtering and sending logs to kafka(destination)
The client sends logs to haproxy and haproxy send logs to syslog-ng using tcp transport
As in tcp the client-server timeout breaks whenever client restored the connection its PRI value is negative which we seeing in wireshark.With this issue the messages gets mixup
Connection restored is normal but PRI value is negative this is incorrect.
I am showing you the the logs
<-1>May 24 2021 17:40:28: %ASA--1-6414004: TCP Syslog Server private:xx.xx.xx.xx/1470 -
Connection restored\\nCAL\\\\John Mike/xxxxxxxxxxxxxxxxxx) to private:xx.xx.xx.xx/xx duration 0:00:00 bytes 142
(John Mike/xxxxxxxxxxxxxxxxxx)\\nxxxxxxx)\\n4 2021 17:40:28: %ASA-6-302016: Teardown UDP connection 1733810491
we've increase the client connection timeout from 1min to 12 hr but the problem is not resolved
Some version of the Cisco ASA TCP Syslog code are affected by bug CSCvz85683:
Symptom:
Wrong syslog message format, ex for 414004:
-1>Sep 08 2021 10:46:25: %ASA--1-6414004: TCP Syslog Server private:xx.xx.xx.xx/1470 - Connection restored\n (xx.xx.xx.xx/64437)
Conditions:
External logging to TCP server is enabled
Workaround:
NA
Further Problem Description:
ASA syslog messages have 6-digit ID
The valid range for message IDs is between 100000 and 999999.
Source: Cisco ASA Series Syslog Messages. About ASA Syslog Messages.
When logging via TCP on versions with the defect code, will shift the priority (6 in this case) into the message code (414004 in this case) and use an illegal priority -1.
According to the bug, this has been fixed in version 9.14.4.

Port exhaustion and random process

I am investigating a network problem on a Windows Server 2016.
Around once a week, all outgoing ports are used up, which means that various network components no longer function properly. If this happens you will find these warnings in the Windows Eventlog:
TCP / IP warning: 4231
"A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use".
TCP / IP warning: 4227
"TCP / IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP / IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP / IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint ".
That looked like a typical handle / socket leak for me and I tried to find the process with "netstat -anobq" to which the connections can be assigned.
This is a very short list of connections in WAIT state:
  ...
  TCP 192.168.24.40:49814 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:49833 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:49880 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50167 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50185 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50236 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50713 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50718 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50725 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:50798 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:50837 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:50887 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:51308 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:51336 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:51360 192.168.24.40:49661 WAIT 0
  TCP 192.168.24.40:51380 192.168.24.10:5432 WAIT 0
  TCP 192.168.24.40:51427 192.168.24.40:49670 WAIT 0
  TCP 192.168.24.40:51487 192.168.24.40:49670 WAIT 0
 [Explorer.exe]
There were a huge list (> 1000) of connections in WAIT state. In this example the process seems to be explorer.exe but if I run the same command a few minutes later the open process is a different one. I captured Firefox.exe, SSHd, Windows Telemetry Service and many other processes with a huge list of wait connections.
The second strange thing is that 90% of these connections point to 192.168.24.10:5432. A Postgres DB runs on 5432 on this server. But FireFox, Explorer.exe and more than 10 other processes do not access this DB.
It looks like netstat is wrong, and the connections belong to another process. Is this even possible?
Windows Defender is running on this server and additionaly I did a scan with Panda Antivirus. The server seems to be clean.
I could lower the wait timeout to close them earlier or increase the number of outgoing connections (currently ~ 16000). But I think that will only move the problem for a few days.
Do you guys have any advice on what checking next?

Too many connections on zookeper server

Environment: HDP 2.6.4
Ambari – 2.6.1
3 zookeeper server
23.1.35.185 - is the IP of the first zookeeper server
hi all,
In the first zookeeper server it seems that even after closing the connection to zookeeper is not getting closed,
which causes the maximum number of client connections to be reached from a host - we have maxClientCnxns as 60 in zookeeper config
As a result when a new application comes and tries to create a connection it fails.
Example when Connections are:
echo stat | nc 23.1.35.185 2181
Latency min/avg/max: 0/71/399
Received: 3031 Sent: 2407
Connections: 67
Outstanding: 622
Zxid: 0x130000004d
Mode: follower
Node count: 3730
But after some time when connection comes to ~70 we see
echo stat | nc 23.1.35.185 2181
Ncat: Connection reset by peer.
And We can see also many CLOSE_WAIT
java 58936 zookeeper 60u IPv6 381963738 0t0 TCP Zookeper_server.sys54.com:eforward->zookeper_server.sys54.com:44983 (CLOSE_WAIT)
From the zookeeper log
2018-12-26 02:50:46,382 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193]
- Too many connections from /23.1.35.185 - max is 60
In the ambari we can see also
Connection failed: [Errno 104] Connection reset by peer to zookeper_server.sys54.com.:2181
I must to say that this not happening on zookeeper servers 2 and 3
NOTE - if we increase the maxClientCnxns to 300 , its not help because after some time we get more the 300 connections ( CLOSE_WAIT ) and then we see from the log
2018-12-26 02:50:49,375 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193] - Too many connections from /23.1.35.187 - max is 300
so any hint why the connection are CLOSE_WAIT ?
CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.
The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.
Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).
There is a kernel level property to reuse the connection and reduce the CLOSE_WAIT time.
I suggest you to follow this tutorial http://www.linuxbrigade.com/reduce-time_wait-socket-connections/
This should probably solve your problem.

Zookeeper status - telnet connections: 4

Could someone help me to understand is it required to have 4 connections for zookeeper.
My requirement is simple - I want to run a apache kafka with spark in my local machine. As per the kafka documentation I had started the zookeeper under the kafka bin and wanted to confirm if my zookeeper is up.
So, tried "telnet localhost 2181" from the command prompt.
And got the below ouput:
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
stats
Zookeeper version: 3.4.5--1, built on 06/10/2013 17:26 GMT
Clients:
/127.0.0.1:34231[1](queued=0,recved=436,sent=436)
/127.0.0.1:34230[1](queued=0,recved=436,sent=436)
/127.0.0.1:37719[0](queued=0,recved=1,sent=0)
/127.0.0.1:34232[1](queued=0,recved=436,sent=436)
Latency min/avg/max: 0/0/42
Received: 2127
Sent: 2136
Connections: 4
Outstanding: 0
Zxid: 0x143
Mode: standalone
Node count: 51
Connection closed by foreign host.
I would like to know as why the connection is saying 4 with 4 clients. what does that actually mean?
Thank you in advance to help me understand if 4 clients are required.
I would like to know as why the connection is saying 4 with 4 clients. what does that actually mean?
It means there are currently four connections open to zookeeper. This connection:
/127.0.0.1:37719[0](queued=0,recved=1,sent=0)
is your telnet localhost 2181 connection.

How to get zxid of a zookeeper server?

Zookeeper assigns a unique number for each transaction called zxid. It has two parts - an epoch and a counter. I could find the epoch value in zookeeper's data directory. However I cant find the counter. Does anyone know where I can find it?
In general, how to get zxid for zookeeper?
Turns out its pretty easy
echo srvr | nc localhost 2181
Also looking at the current status of zookeeper server can show the zxid which is answered in another post. Firt i executed telnet:
telnet localhost 2181
Then send following data to server:
stats
and then received following information:
Zookeeper version: 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
Clients:
/127.0.0.1:54864[1](queued=0,recved=6030,sent=6033)
/192.168.80.1:55675[0](queued=0,recved=1,sent=0)
/192.168.80.1:54769[1](queued=0,recved=432,sent=432)
Latency min/avg/max: 0/0/35
Received: 7104
Sent: 7114
Connections: 3
Outstanding: 0
Zxid: 0xd0
Mode: standalone
Node count: 148
Connection to host lost.
As you see the zxid is currently 0xd0 in my zookeeper server.