Reusing TIME_WAIT socket and the ephemeral port limit - sockets

There is a web Service behind Nginx in a reverse proxy mode.
The Service and Nginx are on the same Linux host. Clients are on a separate host.
Nginx is configured to start a new connection on each HTTP request to the service and it sends the "Connection: close" HTTP header so the receiver closes TCP connection after returning the response to Nginx.
Because the Service always actively closes TCP connections the corresponding sockets are left in TIME_WAIT state.
Between the client and Nginx there is a fixed number of TCP connections because Nginx is configured to use 'keepalive' for incoming requests.
Under a heavy load clients produce a lot of HTTP requests and the number of TIME_WAIT sockets increases. One might expect that soon the system runs out of ephemeral ports, but it does not.
When the total number of sockets used between Nginx and the Service (in any states: TIME_WAIT, ESTABLISHED and so on) reaches exactly the half of a maximum number of ephemeral ports - 14115 (it is (60999-32768)/2) - the system starts reusing TIME_WAIT sockets without waiting for 60s timeout.
In details.
The Service processes incoming requests in ~21ms, so when there is only one client sending request one after another there will be ~2900 (60s/21ms = 2857) sockets in TIME_WAIT state after 60s, then they start transferring to the 'CLOSED' state and can be reused. The output of ss command confirms this.
When there two clients the number of TIME_WAITS is ~5800 and so on. But when the number of TIME_WAITs reaches ~14000 it stops increasing. "ss -o" shows that TIME_WAIT sockets are reused before TIME_WAIT timer expiration (60s).
The question is where does this limit of 14000 come from?

Related

Understanding keepalive between client and cockroachdb with haproxy

We are facing a problem where our client lets name it A. Is attempting to connect DB server (Cockroach) name B load balanced via ha-proxy
A < -- > haproxy < -- > B
Now at every, while our client A is receiving Broken Pipe error.
But I'm not able to understand why?
Cockroach server already has the below default value i.e 60 seconds.
COCKROACH_SQL_TCP_KEEP_ALIVE ## which is enabled to send for 60 second
Plus our haproxy config has the following setting.
defaults
mode tcp
# Timeout values should be configured for your specific use.
# See: https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#4-timeout%20connect
timeout connect 10s
timeout client 1m
timeout server 1m
# TCP keep-alive on client side. Server already enables them.
option clitcpka
option clitcpka
So what is causing the TCP connection to drop when the keepalive is enabled on every end.
Keepalive is what makes connections go away if one of the end points has died without closing the connection. Investigate in that direction.
The only time keepalive actually keeps the connection alive is in connection with an ill-configured firewall that drops idle connections.

Are application level Retransmission and Acknowledgement needed over TCP?

I have the following queries:
1) Does TCP guarantee delivery of packets and thus is thus application level re-transmission ever required if transport protocol used is TCP. Lets say I have established a TCP connection between a client and server, and server sends a message to the client. However the client goes offline and comes back only after say 10 hours, so will TCP stack handle re-transmission and delivering message to the client or will the application running on the server need to handle it?
2) Related to the above question, is application level ACK needed if transport protocol is TCP. One reason for application ACK would be that without it, the application would not know when the remote end received the message. Is there any reason other than that? Meaning is the delivery of the message itself guaranteed?
Does TCP guarantee delivery of packets and thus is thus application level re-transmission ever required if transport protocol used is TCP
TCP guarantees delivery of message stream bytes to the TCP layer on the other end of the TCP connection. So an application shouldn't have to bother with the nuances of retransmission. However, read the rest of my answer before taking that as an absolute.
However the client goes offline and comes back only after say 10 hours, so will TCP stack handle re-transmission and delivering message to the client or will the application running on the server need to handle it?
No, not really. Even though TCP has some degree of retry logic for individual TCP packets, it can not perform reconnections if the remote endpoint is disconnected. In other words, it will eventually "time out" waiting to get a TCP ACK from the remote side and do a few retries. But will eventually give up and notify the application through the socket interface that the remote endpoint connection is in a dead or closed state. Typical pattern is that when a client application detects that it lost the socket connection to the server, it either reports an error to the user interface of the application or retries the connection. Either way, it's application level decision on how to handle a failed TCP connection.
is application level ACK needed if transport protocol is TCP
Yes, absolutely. Most client-server protocols has some notion of a request/response pair of messages. A TCP socket can only indicate to the application if data "sent" by the application is successfully queued to the kernel's network stack. It provides no guarantees that the application on top of the socket on the remote end actually "got it" or "processed it". Your protocol on top of TCP should provide some sort of response indication when ever a message is processed. Use HTTP as a good example here. Imagine if an application would send an HTTP POST message to the server, but there was not acknowledgement (e.g. 200 OK) from the server. How would the client know the server processed it?
In a world of Network Address Translators (NATs) and proxy servers, TCP connections that are idle (no data between each other) can fail as the NAT or proxy closes the connection on behalf of the actual endpoint because it perceives a lack of data being sent. The solution is to have some sort of periodic "ping" and "pong" protocol by which the applications can keep the TCP connection alive in the absences of having no data to send.

How does a TCP server handle multiple Sockets to listening to the same port? [duplicate]

This might be a very basic question but it confuses me.
Can two different connected sockets share a port? I'm writing an application server that should be able to handle more than 100k concurrent connections, and we know that the number of ports available on a system is around 60k (16bit). A connected socket is assigned to a new (dedicated) port, so it means that the number of concurrent connections is limited by the number of ports, unless multiple sockets can share the same port. So the question.
TCP / HTTP Listening On Ports: How Can Many Users Share the Same Port
So, what happens when a server listen for incoming connections on a TCP port? For example, let's say you have a web-server on port 80. Let's assume that your computer has the public IP address of 24.14.181.229 and the person that tries to connect to you has IP address 10.1.2.3. This person can connect to you by opening a TCP socket to 24.14.181.229:80. Simple enough.
Intuitively (and wrongly), most people assume that it looks something like this:
Local Computer | Remote Computer
--------------------------------
<local_ip>:80 | <foreign_ip>:80
^^ not actually what happens, but this is the conceptual model a lot of people have in mind.
This is intuitive, because from the standpoint of the client, he has an IP address, and connects to a server at IP:PORT. Since the client connects to port 80, then his port must be 80 too? This is a sensible thing to think, but actually not what happens. If that were to be correct, we could only serve one user per foreign IP address. Once a remote computer connects, then he would hog the port 80 to port 80 connection, and no one else could connect.
Three things must be understood:
1.) On a server, a process is listening on a port. Once it gets a connection, it hands it off to another thread. The communication never hogs the listening port.
2.) Connections are uniquely identified by the OS by the following 5-tuple: (local-IP, local-port, remote-IP, remote-port, protocol). If any element in the tuple is different, then this is a completely independent connection.
3.) When a client connects to a server, it picks a random, unused high-order source port. This way, a single client can have up to ~64k connections to the server for the same destination port.
So, this is really what gets created when a client connects to a server:
Local Computer | Remote Computer | Role
-----------------------------------------------------------
0.0.0.0:80 | <none> | LISTENING
127.0.0.1:80 | 10.1.2.3:<random_port> | ESTABLISHED
Looking at What Actually Happens
First, let's use netstat to see what is happening on this computer. We will use port 500 instead of 80 (because a whole bunch of stuff is happening on port 80 as it is a common port, but functionally it does not make a difference).
netstat -atnp | grep -i ":500 "
As expected, the output is blank. Now let's start a web server:
sudo python3 -m http.server 500
Now, here is the output of running netstat again:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:500 0.0.0.0:* LISTEN -
So now there is one process that is actively listening (State: LISTEN) on port 500. The local address is 0.0.0.0, which is code for "listening for all ip addresses". An easy mistake to make is to only listen on port 127.0.0.1, which will only accept connections from the current computer. So this is not a connection, this just means that a process requested to bind() to port IP, and that process is responsible for handling all connections to that port. This hints to the limitation that there can only be one process per computer listening on a port (there are ways to get around that using multiplexing, but this is a much more complicated topic). If a web-server is listening on port 80, it cannot share that port with other web-servers.
So now, let's connect a user to our machine:
quicknet -m tcp -t localhost:500 -p Test payload.
This is a simple script (https://github.com/grokit/quickweb) that opens a TCP socket, sends the payload ("Test payload." in this case), waits a few seconds and disconnects. Doing netstat again while this is happening displays the following:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:500 0.0.0.0:* LISTEN -
tcp 0 0 192.168.1.10:500 192.168.1.13:54240 ESTABLISHED -
If you connect with another client and do netstat again, you will see the following:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:500 0.0.0.0:* LISTEN -
tcp 0 0 192.168.1.10:500 192.168.1.13:26813 ESTABLISHED -
... that is, the client used another random port for the connection. So there is never confusion between the IP addresses.
A server socket listens on a single port. All established client connections on that server are associated with that same listening port on the server side of the connection. An established connection is uniquely identified by the combination of client-side and server-side IP/Port pairs. Multiple connections on the same server can share the same server-side IP/Port pair as long as they are associated with different client-side IP/Port pairs, and the server would be able to handle as many clients as available system resources allow it to.
On the client-side, it is common practice for new outbound connections to use a random client-side port, in which case it is possible to run out of available ports if you make a lot of connections in a short amount of time.
A connected socket is assigned to a new (dedicated) port
That's a common intuition, but it's incorrect. A connected socket is not assigned to a new/dedicated port. The only actual constraint that the TCP stack must satisfy is that the tuple of (local_address, local_port, remote_address, remote_port) must be unique for each socket connection. Thus the server can have many TCP sockets using the same local port, as long as each of the sockets on the port is connected to a different remote location.
See the "Socket Pair" paragraph in the book "UNIX Network Programming: The sockets networking API" by
W. Richard Stevens, Bill Fenner, Andrew M. Rudoff at: http://books.google.com/books?id=ptSC4LpwGA0C&lpg=PA52&dq=socket%20pair%20tuple&pg=PA52#v=onepage&q=socket%20pair%20tuple&f=false
Theoretically, yes. Practice, not. Most kernels (incl. linux) doesn't allow you a second bind() to an already allocated port. It weren't a really big patch to make this allowed.
Conceptionally, we should differentiate between socket and port. Sockets are bidirectional communication endpoints, i.e. "things" where we can send and receive bytes. It is a conceptional thing, there is no such field in a packet header named "socket".
Port is an identifier which is capable to identify a socket. In case of the TCP, a port is a 16 bit integer, but there are other protocols as well (for example, on unix sockets, a "port" is essentially a string).
The main problem is the following: if an incoming packet arrives, the kernel can identify its socket by its destination port number. It is a most common way, but it is not the only possibility:
Sockets can be identified by the destination IP of the incoming packets. This is the case, for example, if we have a server using two IPs simultanously. Then we can run, for example, different webservers on the same ports, but on the different IPs.
Sockets can be identified by their source port and ip as well. This is the case in many load balancing configurations.
Because you are working on an application server, it will be able to do that.
I guess none of the answers tells every detail of the process, so here it goes:
Consider an HTTP server:
It asks the OS to bind the port 80 to one or many IP addresses (if you choose 127.0.0.1, only local connections are accepted. You can choose 0.0.0.0 to bind to all IP addresses (localhost, local network, wide area network, both IP versions)).
When a client connects to that port, it WILL lock it up for a while (that's why the socket has a backlog: it queues a number of connection attempts, because they ARE NOT instantaneous).
The OS then chooses a random port and transfer that connection to that port (think of it as a temporary port that will handle all the traffic from now on).
The port 80 is then released for the next connection (first, it will accept the first one in the backlog).
When client or server disconnects, the random port is held open for a while (CLOSE_WAIT in the remote side, TIME_WAIT in the local side). That allows flushing some lost packets along the path. The default time for that state is 2 * MSL seconds (and it WILL consume memory while is waiting).
After that waiting, that random port is free again to receive other connections.
So, TCP cannot even share a port amongst two IP's!
No. It is not possible to share the same port at a particular instant. But you can make your application such a way that it will make the port access at different instant.
Absolutely not, because even multiple connections may shave same ports but they'll have different IP addresses

Does haproxy buffer tcp request body when backend is down?

I am using haproxy 1.6.4 as TCP(not HTTP) proxy.
My clients are making TCP requests. They do not wait for any response, they just send the data and close the connection.
How haproxy behaves when all back-end nodes are down?
I see that (from the client point of view) haproxy is accepting incomming connections.
Haproxy statistics show that front-end has status OPEN, he is accepting connections.
Number of sessions and bytes-in increases for frontend, but not for back-end (he is DOWN).
Is haproxy buffering incoming TCP requests, and will pass them to the back-end once back-end is up?
If yes, it is possible to configure this buffer size? Where data is buffered (in memory, disk?)
Is this possible to turn off front-end (do not accept incoming TCP connections) when all back-end nodes are DOWN?
Edit:
when backend started, I see that
* backend in-bytes and sessions is equal to front-end number of sessions
* but my one and only back-end node has fever number of bytes-in, fever sessions and has errors.
So, it seems that in default configuration there is no tcp buffering.
Data is accepted by haproxy even if all backend nodes are down, but this data is lost.
I would prefer to turn off tcp front-end when there are no backend servers- so client connections would be rejected. Is that possible?
edit:
haproxy log is
Jul 15 10:02:32 172.17.0.2 haproxy[1]: 185.130.180.3:11319
[15/Jul/2016:10:02:32.335] tcp-in app/ -1/-1/0 0 SC \0/0/0/0/0
0/0 908
my log format is
%ci:%cp\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %B\ %ts\ \%ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %U
What I understand from log:
there are no backeend servers
termination state SC translates to
S : the TCP session was unexpectedly aborted by the server, or the
server explicitly refused it.
C : the proxy was waiting for the CONNECTION to establish on the
server. The server might at most have noticed a connection attempt.
I don't think what you are looking for is possible. HAproxy handles the two sides of the connection (frontend, backend) separately. The incoming TCP connection is established first and then HAproxy looks for a matching destination for it.

AWS TCP ELB refuse connection when there is no available back-end server

We have a TCP application that receives connections in a protocol that we did not design and don’t control.
This protocol will assume that if it can establish a TCP connection, then it can send a message and that message is acknowledged.
This works ok if connecting directly to a machine, if the machine or application is down, the tcp connection will be refused or dropped and the client will attempt to redeliver the message.
When we use AWS elastic load balancer, ELB will establish a TCP connection with the client, regardless of whether there is an available back-end server to fulfil the request.
As a result if our application or server crashes then we lose messages.
ELB will close the TCP connection shortly thereafter, but its not good enough.
Is there a way to make ELB, only establish a connection if it can reach the back-end server?
What options do we have (within the AWS ecosystem), of balancing a TCP based service, while still refusing connections if they cannot be served.
I don't think that's achievable through ELB. By design a load balancer will manage 2 sets of connections (frontend - LB and LB - backend). The load balancer will attempt to minimize the time it takes to serve the traffic it receives. This means that the FE-LB connection will be established as the LB looks for a Backend connection to use / reuse. The case in which all of the Backend hosts are dead is such an edge case that you end up with the behavior you are seeing. Normally it's not a big deal as the requested will just get disconnected once the LB figures out that it cannot server the traffic.
Back to your protocol: to me it seem really weird that you would interpret the ability to establish a connection as equal to message delivery. It sounds like you're using TCP but not waiting for the confirmations that the message were actually received at the destination. To me that seems wrong and will get you in trouble eventually with or without a load balancer.
And not to sound too pessimistic (I do understand we are not living in an ideal world) what I would do in this specific scenario, if you can deploy additional software on the client, would be to use a tcp proxy on the client that would get disabled automatically whenever the load balancer is unhealthy/unable to serve traffic. Instruct the client to connect to this proxy. Far from ideal but it should do the trick.
You could create a health check from your ELB to verify if the backend EC2 instances respond on the TCP port. See ELB Health Checks
Then, you monitor the health status of the EC2 instances sent by the ELB to CloudWatch.
Once you determine that none of the EC2 instances are responding on the TCP port, you can remove the TCP listener from the ELB. See Delete ELB Listeners
Hopefully, at that point the ELB stops accepting TCP connections.
Note, I have not tested this solution.