What could "reason: Layer6 timeout" possibly mean? - haproxy

I have a haproxy configured with two servers in the backend. Occasionally, every 16-20h one of them gets marked by haproxy as DOWN:
haproxy.log-20190731:2019-07-30T16:16:24+00:00 <local2.alert> haproxy[2716]: Server be_kibana_elastic/kibana8 is DOWN, reason: Layer6 timeout, check duration: 2000ms. 0 active and 0 backup servers left. 8 sessions active, 0 requeued, 0 remaining in queue.
I did some reading how haproxy runs the checks but the Layer6 timeout does not tell me much. What could be a possible reasons for that timeout? What does it actually mean?
Here is my backend configuration
backend be_kibana_elastic
balance roundrobin
stick on src
stick-table type ip size 100k expire 12h
server kibana8 172.24.0.1:5601 check ssl verify none
server kibana9 172.24.0.2:5601 check ssl verify none

Layer 6 refers to TLS. The backend is accepting a TCP connection but isn't negotiating TLS (SSL) on the health check connection within the allowed time.
The configuration values timeout connect, timeout check, and inter all interact to determine how much time health checks are allowed, to complete, and the default value of inter if not specified is 2000 milliseconds, which is what you're seeing here. By default, inter (health check interval) determines both how often checks run and how long they are allowed to complete.
Since you have not configured a fall count for the servers, the implication is that the default value 3 is being used, which means your server is actually failing 3 consecutive health checks, before being marked down.
Consider adding option log-health-checks to the backend declaration, which will create additional log entries of those initial failing checks before the final one causes the backend to be marked down.
Increasing the allowable time may avoid the failure, but is probably valid only for testing -- not a fix -- because if your backend can't reliably respond to a check within 2000 ms, then it also can't reliably respond to client connections within that time frame, which is a long time to wait for a response.
Note that in typical environments, intermittent packet loss will typically cause sluggish behavior in increments of 3000 ms, because TCP stacks often use a retransmission timeout (RTO) of 3 seconds. Since this is more than 2000 ms, packet loss on your network is one possible explanation for the problem.
Another possible explanation is excessive CPU load on the backend, either related to traffic or to a cron job doing something intensive, because TLS negotiation -- relatively speaking -- is an expensive process from the CPU's perspective.

Related

maximum number if requests per TCP connection

I am acting as server which receives multiple requests from client in socket and handles in a thread.
Should i set any parameter in TCP level to set maximum number of requests a connection can handle simultaneously?
because in my server side ,if processing the request is slow i observe that other requests are queued up (client says request has been sent but i receive it late)
Kindly guide me
If it takes a long time to do the work and you want to handle multiple connections simultaneously, you have to change how you do things.
If you are actively using a lot of CPU during processing a long request, you'll need multiple threads. That's the only way to actually get more CPU time / second -- assuming you have multiple cores available.
If you are waiting on things like file IO, then you can instead use asynchronous processing to handle the requests on a single thread, but just handle a little piece at a time.
Setting a maximum number of TCP connections won't help you handle more processes more quickly. It will just reject connections and not even allow a first-come first-served type of behavior - it will just be random if a specific client ever gets through or not.

SAPUI5 request timeout to the Gateway

I have an odata-request in my SAPUI5 application which calls the Gateway.
On the Gateway, I have an Trusted RFC connection to the backend.
Now I have a complex algorithm with a duration around 2 minutes.
After 60 seconds, I get an timeout error.
HTTP request failed500,Internal Server Error,500 Connection timed out
Is there a opportunity to increase the timeout?
I tried it with the parameters gw/reg_timeout gw/conn_pending and with the keepalive-timeout of the rfc connection.
All this options haven´t solved my problem.
I guess you already tried everything from SAP Help.
Maybe this is some ICM/WebDispatcher timeout, check the link and try some of the settings, i.e. PROCTIMEOUT. And also consider the hints there:
Recommendation
In systems where the standard timeout setting of 60
seconds for the keep-alive and processing timeouts is not sufficient
due to long-running applications, SAP recommends that both the TIMEOUT
and PROCTIMEOUT parameters are set for the services concerned so that
they can be configured independently of each other. The TIMEOUT value
should not be set unnecessarily high. We recommend you set this
parameter as follows:
icm/server_port_0 = PROT=HTTP,PORT=1080,TIMEOUT=60,PROCTIMEOUT=600
in order to allow a
maximum processing time of 10 minutes.

What is the overhead traffic of a TCP connection (plus TCP clarifications)?

We have a TCP connection.
Nothing is sent over; how many traffic(bytes) are needed for each second to keep that connection open?
What is the duration of opening a connection from a client in South America to a server in North Europe?
If I have to send small amount of data (max 256bytes) at x seconds interval, what would be x for which is better to close the connection and reopen again instead of keeping the connection always open?
I do not expect exact data - estimates will suffice.
1) none.
2) some time. Try it and see. For a rough estimate, ping one end from the other and double it.
3) try it. It depends on bandwidth and, more importantly, latency. These vary over wide ranges. Usually, it's better, speed-wise, to keep connections open. 256 bytes at intervals of seconds? I would keep the connection open, especially over paths with possibly high latency, (eg. intercontinental).
1. According to the TCP/IP standard, nothing. However, depending on the network conditions and any middleboxes (NAT devices, firewalls, etc.), a connection with no data going over it may be dropped. That could be a staic timeout (say two minutes, or ten minutes, or an hour), or it could be based on a least-recently-used table in some device.
2. It depends on a lot of factors, and the biggest delay may be from the client's local network rather than the intercontinental connection. However, the surface of the earth between the points is about 40 light-millisenconds, so (without TCP Fast Open) that would be 120 ms for the first data packet to get from the client to the server and 40 ms for the response, 80 ms more than in an active connection.
3. Assuming no broken middleboxes, always better to keep the connection open. However, the delay to recover from a "silently dropped" connection may be a lot longer than the time to open a new one; it might be appropriate for the client to manage its own timeout (on hte order of a second or so), and open a new connection and retry the last message if it hasn't gotten a response by then. Depends on what you're sending; transactional messages might merit such explicit fast retry more than a remote copy of syslog.

vxworks 6.3 active sockets maxs out at 255?

I have a LPD server running on vxworks 6.3. The client application (over which I have no control) is sending me a LPQ query every tenth of a second. After 235 requests, the client receives a RST when trying to connect. After a time device will again accept some queries (about 300), until it again starts sending out RST.
I have confirmed that it is the TCP stack that is causing the RST. There are some things that I have noticed.
1) I can somewhat change the number of sockets that will accepted if I change the number of other applications that are running. For example, I freed up 4 sockets thereby changing the number accepted from 235 to 239.
2) If I send requests to lpr (port 515) and another port (say, port 80), the total number of connections that are accepted before the RST start happening stays constant at 235.
3) There are lots of sockets sitting TIME_WAIT.
4) I have a mock version of the client. If I slow the client down to one request every quarter second, the server doesn't reject the connections.
5) If I slow down the server's responses, I don't have any connections rejected.
So my theory is that there is some share resource (my top guess is total number of socket handles) that VxWorks can have consumed at a given time. I'm also guessing that this number tops out at 255.
Does anyone know how I can get VxWorks to accept more connections, and leave them in TIME_WAIT when closed? I have looked through the kernel configuration and changed all the values that looked remotely likely, but I have not been able change the number.
We know that we could set SO_LINGER but this not an acceptable solution. However, this does prevent the client connections from getting rejected. We have also tried changed the timeout value for SO_LINGER. This does not appear to be supported in VxWorks. It's either on or off.
Thanks!
Gail
To me it sounds like you are making a new connection for every LPQ query, and after the query is done you aren't closing the connection. In my opinion the correct thing to do is to accept one TCP connection and then use that to get all of the LPQ queries, however that may require mods to the client application. To avoid mods to the client, you should just close the TCP connection after each LPQ query.
Furthermore you can set the max number of FDs open in vxworks by adjusting the #define NUM_FILES config.h (or configall.h or one of those files), but that will just postpone an error if you have a FD leak, which you probably do.

How to set the keep alive interval for winsock

I am using winsock and TCP.
I have set the KeepAlive option as follows
int aliveToggle = 1;
setsockopt(mySocket,SOL_SOCKET,SO_KEEPALIVE,(char*)&aliveToggle, sizeof(aliveToggle));
But how to specify the Keep aLive time and interval?
I am using VC++ running on windows 7.
From c/c++ you should be able to use SIO_KEEPALIVE_VALS to control the timeouts. You can't use setsockopt, but you should be able to use WSAIoctl. See https://web.archive.org/web/20130828175019/http://msdn.microsoft.com/en-us/library/windows/desktop/dd877220(v=vs.85).aspx
Here's an example https://web.archive.org/web/20130827074722/http://read.pudn.com/downloads79/ebook/301417/Chapter09/SIO_KEEPALIVE_VALS/alive.c__.htm
Two per-interface registry settings under the key \HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Tcpip\Parameters control the behavior of TCP/IP keep-alives:
The KeepAliveTime value specifies how long the TCP connection sits idle, with no traffic, before TCP sends a keep-alive packet. The default is 7,200,000 milliseconds (ms) or 2 hours.
The KeepAliveInterval value indicates how many milliseconds to wait for a response after sending a keep-alive before repeating the keep-alive. If no response is received, the TCP/IP stack continues sending keep-alives at this interval until a response is received or until the stack reaches the packet retry limit specified in the TCPMaxDataRetransmissions registry key. KeepAliveInterval defaults to 1 second (1000 .
TCP keep-alives are disabled by default, but Windows Sockets applications can use the setsockopt function to enable them on a per-connection basis.
Note  If the developer elects to use TCP keep-alive messages on a particular connection, the timing of those messages is specified by the registry values described preceding. It is not possible to use different timing on different keep-alive requests.