Too many connections on zookeper server - apache-kafka

Environment: HDP 2.6.4
Ambari – 2.6.1
3 zookeeper server
23.1.35.185 - is the IP of the first zookeeper server
hi all,
In the first zookeeper server it seems that even after closing the connection to zookeeper is not getting closed,
which causes the maximum number of client connections to be reached from a host - we have maxClientCnxns as 60 in zookeeper config
As a result when a new application comes and tries to create a connection it fails.
Example when Connections are:
echo stat | nc 23.1.35.185 2181
Latency min/avg/max: 0/71/399
Received: 3031 Sent: 2407
Connections: 67
Outstanding: 622
Zxid: 0x130000004d
Mode: follower
Node count: 3730
But after some time when connection comes to ~70 we see
echo stat | nc 23.1.35.185 2181
Ncat: Connection reset by peer.
And We can see also many CLOSE_WAIT
java 58936 zookeeper 60u IPv6 381963738 0t0 TCP Zookeper_server.sys54.com:eforward->zookeper_server.sys54.com:44983 (CLOSE_WAIT)
From the zookeeper log
2018-12-26 02:50:46,382 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193]
- Too many connections from /23.1.35.185 - max is 60
In the ambari we can see also
Connection failed: [Errno 104] Connection reset by peer to zookeper_server.sys54.com.:2181
I must to say that this not happening on zookeeper servers 2 and 3
NOTE - if we increase the maxClientCnxns to 300 , its not help because after some time we get more the 300 connections ( CLOSE_WAIT ) and then we see from the log
2018-12-26 02:50:49,375 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#193] - Too many connections from /23.1.35.187 - max is 300
so any hint why the connection are CLOSE_WAIT ?

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.
The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.
Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).
There is a kernel level property to reuse the connection and reduce the CLOSE_WAIT time.
I suggest you to follow this tutorial http://www.linuxbrigade.com/reduce-time_wait-socket-connections/
This should probably solve your problem.

Related

Can syslog pri value can be negative?

First i will tell you my architecture
client--->haproxy--->syslog-ng--->kafka
the client is Cisco ASA and haproxy is server for load-balancing and syslog-ng is for receiving ,filtering and sending logs to kafka(destination)
The client sends logs to haproxy and haproxy send logs to syslog-ng using tcp transport
As in tcp the client-server timeout breaks whenever client restored the connection its PRI value is negative which we seeing in wireshark.With this issue the messages gets mixup
Connection restored is normal but PRI value is negative this is incorrect.
I am showing you the the logs
<-1>May 24 2021 17:40:28: %ASA--1-6414004: TCP Syslog Server private:xx.xx.xx.xx/1470 -
Connection restored\\nCAL\\\\John Mike/xxxxxxxxxxxxxxxxxx) to private:xx.xx.xx.xx/xx duration 0:00:00 bytes 142
(John Mike/xxxxxxxxxxxxxxxxxx)\\nxxxxxxx)\\n4 2021 17:40:28: %ASA-6-302016: Teardown UDP connection 1733810491
we've increase the client connection timeout from 1min to 12 hr but the problem is not resolved
Some version of the Cisco ASA TCP Syslog code are affected by bug CSCvz85683:
Symptom:
Wrong syslog message format, ex for 414004:
-1>Sep 08 2021 10:46:25: %ASA--1-6414004: TCP Syslog Server private:xx.xx.xx.xx/1470 - Connection restored\n (xx.xx.xx.xx/64437)
Conditions:
External logging to TCP server is enabled
Workaround:
NA
Further Problem Description:
ASA syslog messages have 6-digit ID
The valid range for message IDs is between 100000 and 999999.
Source: Cisco ASA Series Syslog Messages. About ASA Syslog Messages.
When logging via TCP on versions with the defect code, will shift the priority (6 in this case) into the message code (414004 in this case) and use an illegal priority -1.
According to the bug, this has been fixed in version 9.14.4.

Netstat output with boost::Asio

I have created an asio server with acceptor:
m_acceptor(m_ios, asio::ip::tcp::endpoint(asio::ip::address_v4::any(), port_num)
where port number is 3333
At this point, the netstat -antup command shows :
13:tcp 0 0 0.0.0.0:3333 0.0.0.0:* LISTEN 26566/./test
So, I believe this means that local address 0 0.0.0.0:3333 is ready to listen to any connection on port 3333
After this, I start the client which creates the endpoint to ip : 127.0.0.1 and port 3333
After this, the netstat output is:
tcp 0 0 0.0.0.0:3333 0.0.0.0:* LISTEN 26566/./test
tcp 0 0 127.0.0.1:3333 127.0.0.1:46675 ESTABLISHED 26566/./test
tcp 0 0 127.0.0.1:46675 127.0.0.1:3333 ESTABLISHED 26685/./test
Process 26566 is master process
Process 26685 is slave process
What I do not understand is what does the the port 46675 mean in the address shown above? This definitely represents the client side, but from where was this port number allocated to the client?
Does this mean that client has connected to port 3333 but the port from which it itself connects is 46675?
Does this mean that client has connected to port 3333 but the port from which it itself connects is 46675?
Basically. It describes the client endpoint. This is BSD/Posix sockets jargon.
What I do not understand is what does the the port 46675 mean in the address shown above? This definitely represents the client side, but from where was this port number allocated to the client?
It gets automatically chosen (by the TCP stack, usually in the kernel) from the local port range. E.g. on linux you can manipulate that range (if you have permission):
sudo sysctl -w net.ipv4.ip_local_port_range="60000 61000"
(Warning: don't do this unless you know what you're doing). See also https://en.wikipedia.org/wiki/Ephemeral_port

Zookeeper status - telnet connections: 4

Could someone help me to understand is it required to have 4 connections for zookeeper.
My requirement is simple - I want to run a apache kafka with spark in my local machine. As per the kafka documentation I had started the zookeeper under the kafka bin and wanted to confirm if my zookeeper is up.
So, tried "telnet localhost 2181" from the command prompt.
And got the below ouput:
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
stats
Zookeeper version: 3.4.5--1, built on 06/10/2013 17:26 GMT
Clients:
/127.0.0.1:34231[1](queued=0,recved=436,sent=436)
/127.0.0.1:34230[1](queued=0,recved=436,sent=436)
/127.0.0.1:37719[0](queued=0,recved=1,sent=0)
/127.0.0.1:34232[1](queued=0,recved=436,sent=436)
Latency min/avg/max: 0/0/42
Received: 2127
Sent: 2136
Connections: 4
Outstanding: 0
Zxid: 0x143
Mode: standalone
Node count: 51
Connection closed by foreign host.
I would like to know as why the connection is saying 4 with 4 clients. what does that actually mean?
Thank you in advance to help me understand if 4 clients are required.
I would like to know as why the connection is saying 4 with 4 clients. what does that actually mean?
It means there are currently four connections open to zookeeper. This connection:
/127.0.0.1:37719[0](queued=0,recved=1,sent=0)
is your telnet localhost 2181 connection.

What happens when TCP keepalive probe failed on an established socket?

To be more clear, I am wondering what TCP would a socket be in when keepalive probe failed too many times?
For example right now I have the following for netstat -anop
tcp 0 0 10.10.10.10:12345 11.11.11.11:56789 ESTABLISHED 12345/process keepalive (7200.00/0/0)
Let's say host 11.11.11.11 suddenly lost power forever, and host 10.10.10.10's keepalive probe will eventually detect the broken connection. When 10.10.10.10 detects the broken connection, what will the socket's state be in as shown by netstat?
The connection will be reset and the line will disappear from netstat.

How to get zxid of a zookeeper server?

Zookeeper assigns a unique number for each transaction called zxid. It has two parts - an epoch and a counter. I could find the epoch value in zookeeper's data directory. However I cant find the counter. Does anyone know where I can find it?
In general, how to get zxid for zookeeper?
Turns out its pretty easy
echo srvr | nc localhost 2181
Also looking at the current status of zookeeper server can show the zxid which is answered in another post. Firt i executed telnet:
telnet localhost 2181
Then send following data to server:
stats
and then received following information:
Zookeeper version: 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
Clients:
/127.0.0.1:54864[1](queued=0,recved=6030,sent=6033)
/192.168.80.1:55675[0](queued=0,recved=1,sent=0)
/192.168.80.1:54769[1](queued=0,recved=432,sent=432)
Latency min/avg/max: 0/0/35
Received: 7104
Sent: 7114
Connections: 3
Outstanding: 0
Zxid: 0xd0
Mode: standalone
Node count: 148
Connection to host lost.
As you see the zxid is currently 0xd0 in my zookeeper server.