We are running production workloads with Istio 1.1.4 and noticed that for a specific timeframe, the request latency reported to the telemetry component for client invoked traffic increased from 50-60ms to 6-7 seconds and at the same time we started observing 500 (internal server error) response codes from Envoy.
We are trying to understand under what cases Envoy returns 500 and the only thing I could find in the documentation/source code was that a 500 is returned if the response body must be buffered and it exceeds the buffer limit. This is certainly not the case for us, as those 500 occurred for a health check endpoint beyond other endpoints, whose response body is very small.
What are the cases where Envoy will return 500? What should we investigate as the root cause of the issue?
Can you please provide the status code for below ?
a) Log Entry
b) Telemetry
c) Prometheus and Grafana
and just see if all three above shows response code as 500 or any deviation ?
Related
We have a private REST API that is locked down and only ever called by software we control, not the public. Many endpoints take a JSON payload. If deserialising the JSON payload fails (eg. the payload has an int where a Guid is expected), an exception is thrown and the API is returning a 500 Internal Server Error. Technically, it should return a 400 Bad Request in this circumstance.
Without knowing how much effort is required to ensure a 400 is returned in this circumstance, is there benefit in changing the API to return a 400? The calling software and QA are the only entities that see this error, and it only occurs if the software is sending data that doesn't match the expected model which is a critical defect anyway. I see this as extra effort and maintenance for no gain.
Am I missing something here that the distinction between 400 and 500 would significantly help with?
From a REST perspective:
If you want to follow strict REST principals, you should return 4xx as the problem is with the data being sent and not the server program
5xx are reserved for server errors. For example if the server was not able to execute the method due to site outage or software defect. 5xx range status codes SHOULD NOT be utilized for validation or logical error handling.
From a technical perspective:
The reported error does not convey useful information if tomorrow another programmer/team will work on the issue
If tomorrow you have to log your errors in a central error log, you will pollute it will wrong status codes
As a consequence, if QA decides to run reports/metrics on errors, they will be erroneous
You may be increasing your technical debt which can impact your productivity in the future. link
The least you can do is to log this issue or create a ticket if you use a tool like JIRA.
Should it matter if a call to a private REST API returns 400 or 500?
A little bit.
The status code is meta data:
The status-code element is a 3-digit integer code describing the result of the server's attempt to understand and satisfy the client's corresponding request. The rest of the response message is to be interpreted in light of the semantics defined for that status code.
Because we have a shared understanding of the status codes, general purpose clients can use that meta data to understand the broad meaning of the response, and take sensible actions.
The primary difference between 4xx and 5xx is the general direction of the problem. 4xx indicates a problem in the request, and by implication with the client
The 4xx (Client Error) class of status code indicates that the client seems to have erred.
5xx indicates a problem at the server.
The 5xx (Server Error) class of status code indicates that the server is aware that it has erred or is incapable of performing the requested method
So imagine, if you would, a general purpose reverse proxy acting as a load balancer. How might the proxy take advantage of the ability to discriminate between 4xx and 5xx.
Well... 5xx suggests that the query itself might be fine. So the proxy could try routing the request to another healthy instance in the cluster, to see if a better response is available. It could look at the pattern of 5xx responses from a specific member of the cluster, and judge whether that instance is healthy or unhealthy. It could then evict that unhealthy instance and provision a replacement.
On the other hand, with a 4xx status code, none of those mitigations make any sense - we know instead that the problem is with the client, and that forwarding the request to another instance isn't going to make things any better.
Even if you aren't going to automatically mitigate the server errors, it can still be useful to discriminate between the two error codes, for internal alarms and reporting.
(In the system I maintain, we're using general purpose monitoring that distinguishes 4xx and 5xx responses, with different thresholds to determine if I should be paged. As you might imagine, I'm rather invested in having that system be well tuned.)
Some rogue people have set up server monitoring that connects to server every 2 minutes to check if it's down (they connect from several different accounts so they ping the server every 20 seconds or so). It's a simple GET request.
I have two options:
Leave it as it is (ie. allow them via a normal 200 server response).
Block them by either IP or user-agent (giving 403 response).
My question is - what is the better solution as far as server performance is concerned (ie. what is less 'stressful' on the server) - 1 (200 response) or 2 (403 response)?
I'm inclined to #1 since there would be no IP / user-agent checking which should mean less stress on the server, correct?
It doesn't matter.
The status code and an if-check on the user-string is completely dominated by network IO, gc and server subsystems.
If they just query every 2 minutes, I'd very much leave it alone. If they query a few hundred times per second; time to act.
I have defined an API for giving data from MongoDB. But, the problem is, if I hit the api continuously from same IP address, the results are not consistent. If it gives proper result for first time, the next time it gives failed to connect. If I hit just "hello world" api, it won't fail no matter how frequently I hit from same IP. I am listening to port range of HTTP 80. Can anyone please advise me the problem and how to solve this. I'm new to this server concepts.
In my humble opinion, Perfect has already a high availability. Even in the most affordable VM, the api response should be still fast enough. This is my load testing result:
$ wrk -t12 -c400 -d30s http://localhost:19808/
Running 30s test # http://localhost:19808/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 30.98ms 10.10ms 86.14ms 73.83%
Req/Sec 546.08 267.66 1.75k 58.56%
194376 requests in 30.07s, 27.07MB read
Socket errors: connect 157, read 717, write 0, timeout 0
Requests/sec: 6464.58
Transfer/sec: 0.90MB
so even in the extremest situation, there is only 0.8% opportunity to see a bad connection. Please share us your source code otherwise no one will get the clue of what is going on your AWS instance.
I have an odata-request in my SAPUI5 application which calls the Gateway.
On the Gateway, I have an Trusted RFC connection to the backend.
Now I have a complex algorithm with a duration around 2 minutes.
After 60 seconds, I get an timeout error.
HTTP request failed500,Internal Server Error,500 Connection timed out
Is there a opportunity to increase the timeout?
I tried it with the parameters gw/reg_timeout gw/conn_pending and with the keepalive-timeout of the rfc connection.
All this options havenĀ“t solved my problem.
I guess you already tried everything from SAP Help.
Maybe this is some ICM/WebDispatcher timeout, check the link and try some of the settings, i.e. PROCTIMEOUT. And also consider the hints there:
Recommendation
In systems where the standard timeout setting of 60
seconds for the keep-alive and processing timeouts is not sufficient
due to long-running applications, SAP recommends that both the TIMEOUT
and PROCTIMEOUT parameters are set for the services concerned so that
they can be configured independently of each other. The TIMEOUT value
should not be set unnecessarily high. We recommend you set this
parameter as follows:
icm/server_port_0 = PROT=HTTP,PORT=1080,TIMEOUT=60,PROCTIMEOUT=600
in order to allow a
maximum processing time of 10 minutes.
I have an HAProxy acting as a load balancer for other boxes.
I know that when a box returns a response in the 500 range (on a health check), haproxy takes the box out of rotation.
What does it do if it (the proxy) gets a 503? (from a health check) 503s normally mandate a retry. Does it retry according to the Retry-After Header or does it take the box out of rotation?
If it retrys, does the header matter? In other words, if there is no Retry-After header, does it still honor the 503 and retry? or does it count that as a box error and remove the box from rotation?
Haproxy processes any 500 response as an error. https://code.google.com/p/haproxy-docs/wiki/httpchk
Only 200's and 300's are considered successes. All others are considered failures.
The answer to the second part of your question depends on how you have your health check intervals set. If you have them set to take the host of out rotation after 1 failure and the host returns a 503, then yes it will be removed from rotation. If you have it configure to require 2 failures and the host only returns 1 sequential 503 and then starts returning 200's then the host will stay in rotation.