Does HAProxy honor a 503? - haproxy

I have an HAProxy acting as a load balancer for other boxes.
I know that when a box returns a response in the 500 range (on a health check), haproxy takes the box out of rotation.
What does it do if it (the proxy) gets a 503? (from a health check) 503s normally mandate a retry. Does it retry according to the Retry-After Header or does it take the box out of rotation?
If it retrys, does the header matter? In other words, if there is no Retry-After header, does it still honor the 503 and retry? or does it count that as a box error and remove the box from rotation?

Haproxy processes any 500 response as an error. https://code.google.com/p/haproxy-docs/wiki/httpchk
Only 200's and 300's are considered successes. All others are considered failures.
The answer to the second part of your question depends on how you have your health check intervals set. If you have them set to take the host of out rotation after 1 failure and the host returns a 503, then yes it will be removed from rotation. If you have it configure to require 2 failures and the host only returns 1 sequential 503 and then starts returning 200's then the host will stay in rotation.

Related

How to make haproxy dispense two following requests to a node

In a two node scenario, using roundrobin, I want haproxy to dispense two requests to each node before switching to the next node.
I have a messaging application, which makes one request for getting a messageID, then the next for sending the message.
If i use a standard roundrobin algorithm on two backend servers, this leads to one server only getting the messageID requests, and the other doing all the message sending.
This is not really balanced, as providing messageIDs is a nobrainer to the server, and handling the messages, which can be up to a few hundret MBs, is all done by the other node.
I had a look at weighted roundrobin, but if seems not to work out, when using a weight of 2 for both servers, as the weights seem to get calculated relatively to each other.
I'd be glad for any hint, how to achieve haproxy switching the backend nodes after sending two requests, instead of one.
this is my current configuration, which still leads to a claer one here one there round robin pattern:
### frontend XTA Entry TLS/CA
frontend GMM_XTA_Entry_TLS_CA
mode tcp
bind 10.200.0.20:8444
default_backend GMM_XTA_Entrypoint_TLS_CA
### backend XTA Entry TLS/CA
backend GMM_XTA_Entrypoint_TLS_CA
mode tcp
server GMMAPPLB1-XTA-CA 10.200.0.21:8444 check port 8444 inter 1s rise 2 fall 3 weight 2
server GMMAPPLB2-XTA-CA 10.200.0.22:8444 check port 8444 inter 1s rise 2 fall 3 weight 2
well, like stated, I would need a "two requests here, two requests there" round robin pattern, but it keeps doing "one here, one there".
Glad for any hint, cheers,
Arend
To get the behavior you want where requests go to a server 2 at a time, you can add an extra consecutive server line for each backend, like so:
backend GMM_XTA_Entrypoint_TLS_CA
balance roundrobin
mode tcp
server GMMAPPLB1-XTA-CA_1 10.200.0.21:8444 check port 8444 inter 1s rise 2 fall 3
server GMMAPPLB1-XTA-CA_2 10.200.0.21:8444 track GMMAPPLB1-XTA-CA_1
server GMMAPPLB2-XTA-CA_1 10.200.0.22:8444 check port 8444 inter 1s rise 2 fall 3
server GMMAPPLB2-XTA-CA_2 10.200.0.22:8444 track GMMAPPLB2-XTA-CA_1
However, if you can use HAProxy 1.9 or above, you can also use the balance random option which should randomly distribute requests evenly across your servers. I think this may solve the balancing problem you stated above more directly. Also, using balance random will still balance your requests randomly if the type of requests change.
the proposed answer using 4 server entries in the backend did the job.
I am not sure, if it is the most elegant solution, but it did help me understanding the usage of backends a bit more, again thanks for that.

What could "reason: Layer6 timeout" possibly mean?

I have a haproxy configured with two servers in the backend. Occasionally, every 16-20h one of them gets marked by haproxy as DOWN:
haproxy.log-20190731:2019-07-30T16:16:24+00:00 <local2.alert> haproxy[2716]: Server be_kibana_elastic/kibana8 is DOWN, reason: Layer6 timeout, check duration: 2000ms. 0 active and 0 backup servers left. 8 sessions active, 0 requeued, 0 remaining in queue.
I did some reading how haproxy runs the checks but the Layer6 timeout does not tell me much. What could be a possible reasons for that timeout? What does it actually mean?
Here is my backend configuration
backend be_kibana_elastic
balance roundrobin
stick on src
stick-table type ip size 100k expire 12h
server kibana8 172.24.0.1:5601 check ssl verify none
server kibana9 172.24.0.2:5601 check ssl verify none
Layer 6 refers to TLS. The backend is accepting a TCP connection but isn't negotiating TLS (SSL) on the health check connection within the allowed time.
The configuration values timeout connect, timeout check, and inter all interact to determine how much time health checks are allowed, to complete, and the default value of inter if not specified is 2000 milliseconds, which is what you're seeing here. By default, inter (health check interval) determines both how often checks run and how long they are allowed to complete.
Since you have not configured a fall count for the servers, the implication is that the default value 3 is being used, which means your server is actually failing 3 consecutive health checks, before being marked down.
Consider adding option log-health-checks to the backend declaration, which will create additional log entries of those initial failing checks before the final one causes the backend to be marked down.
Increasing the allowable time may avoid the failure, but is probably valid only for testing -- not a fix -- because if your backend can't reliably respond to a check within 2000 ms, then it also can't reliably respond to client connections within that time frame, which is a long time to wait for a response.
Note that in typical environments, intermittent packet loss will typically cause sluggish behavior in increments of 3000 ms, because TCP stacks often use a retransmission timeout (RTO) of 3 seconds. Since this is more than 2000 ms, packet loss on your network is one possible explanation for the problem.
Another possible explanation is excessive CPU load on the backend, either related to traffic or to a cron job doing something intensive, because TLS negotiation -- relatively speaking -- is an expensive process from the CPU's perspective.

Envoy proxy returning 500

We are running production workloads with Istio 1.1.4 and noticed that for a specific timeframe, the request latency reported to the telemetry component for client invoked traffic increased from 50-60ms to 6-7 seconds and at the same time we started observing 500 (internal server error) response codes from Envoy.
We are trying to understand under what cases Envoy returns 500 and the only thing I could find in the documentation/source code was that a 500 is returned if the response body must be buffered and it exceeds the buffer limit. This is certainly not the case for us, as those 500 occurred for a health check endpoint beyond other endpoints, whose response body is very small.
What are the cases where Envoy will return 500? What should we investigate as the root cause of the issue?
Can you please provide the status code for below ?
a) Log Entry
b) Telemetry
c) Prometheus and Grafana
and just see if all three above shows response code as 500 or any deviation ?

What is the most appropriate HTTP status code for an already processed POST request?

I have a RESTful API that is used by another internal application that post updates to it.
The problem is that some unexpected peaks occur and, during those times, a request might take longer than 60 seconds (the limit defined by the load balancer, which I cannot change) to respond, which causes a 504 Gateway Timeout error.
When the latter application gets such response, it will retry the request again after 10 minutes or so.
This caused some requests to be processed twice, because the first request was successful, but took more than 60 seconds.
So I decided to use Idempotency Keys in the requests to avoid this problem. The issue is that I don't know what I should return in this case.
Should I just stick with 200 OK? Should I return some 4xx code?
I'd say it highly depends if it is an error for you or not. But I'd say the exact response code is more a matter of taste rather than best practice. But since I guess you're rejecting the duplicated requests, you want to report an error code such as 409 Conflict:
Indicates that the request could not be processed because of conflict
in the current state of the resource, such as an edit conflict between
multiple simultaneous updates.
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_Client_errors
Whenever a resource conflict would be caused by fulfilling the request. Duplicate entries and deleting root objects when cascade-delete is not supported are a couple of examples.
https://www.restapitutorial.com/httpstatuscodes.html
A potentially useful reference is RFC 5789, which describes the PATCH method. Obviously, you aren't doing a patch, but the error handling is analogous.
For instance, if you were sending a JSON Patch document, then you might be ensuring idempotent behavior by including a test operation that checks that the resource is in the expected initial state. After your operation, that check would presumably fail. In that case, the error handling section directs your attention to RFC 5789 -- section 2.2 outlines a number of different possible cases.
Another source of inspiration is to look at RFC 7232 which describes conditional requests. The section on If-Match includes this gem:
An origin server MUST NOT perform the requested method if a received If-Match condition evaluates to false; instead, the origin server MUST respond with either a) the 412 (Precondition Failed) status code or b) one of the 2xx (Successful) status codes if the origin server has verified that a state change is being requested and the final state is already reflected in the current state of the target resource (i.e., the change requested by the user agent has already succeeded, but the user agent might not be aware of it, perhaps because the prior response was lost or a compatible change was made by some other user agent).
From this, I infer that 200 is completely acceptable if you can determine that the work was already done successfully.

200 vs 403 server response - which degrades server's performance more?

Some rogue people have set up server monitoring that connects to server every 2 minutes to check if it's down (they connect from several different accounts so they ping the server every 20 seconds or so). It's a simple GET request.
I have two options:
Leave it as it is (ie. allow them via a normal 200 server response).
Block them by either IP or user-agent (giving 403 response).
My question is - what is the better solution as far as server performance is concerned (ie. what is less 'stressful' on the server) - 1 (200 response) or 2 (403 response)?
I'm inclined to #1 since there would be no IP / user-agent checking which should mean less stress on the server, correct?
It doesn't matter.
The status code and an if-check on the user-string is completely dominated by network IO, gc and server subsystems.
If they just query every 2 minutes, I'd very much leave it alone. If they query a few hundred times per second; time to act.