REST API status when external APIs are down - Best Practices - rest

I'm looking for guidance on good practices when it comes to returning errors from a REST API. I'm working on a new API so I can take it any direction.
In my case client invokes my API which internally invokes some external APIs. In case of success no problem, but in case of error responses from the far end(external cloud APIs) I am not sure what is industry standard for such services. Am currently thinking of returning 200 OK and then a json payload which details about the external API errors.
So what is the industry recommendations? Good practices (please explain why!) and also, from a client pov, what kind of error handling in the REST API makes life easier for the client code?

The failure you're asking about is one that has occurred within the internals of the service itself, though it is having external dependencies, so a 5XX status code range is the correct choice. 503 Service Unavailable looks perfect for the situation you've described.
5XX codes used for telling the client that even though the request was fine, the server has had some kind of problem fulfilling the request. On the other hand,
4XX codes are used to tell the client that it has done something wrong in request (and that the server is just fine, thanks).
Sections 10.4 and 10.5 of the HTTP 1.1 spec explain the different purposes of 4XX and 5XX codes.

Our colleagues have already provided the links / explanations about the HTTP status codes so you should learn them and find the most appropriate in your case.
I'll more concentrate on what can influence your decisions, assuming you've learnt the status codes.
Basically, You should understand what are the business implications of the flow triggered by client when he/she calls "your" API. The client doesn't know anything about the external cloud API you're working with and doesn't really care whether it works or not, the client works with your application.
If so, when the remote system returns some kind of error (and yes, different error statuses should give you a clue of what's wrong with the remote system), its your business decision about how to handle this error, and depending on this decision you might want to "behave" differently in the interaction with a client.
Here are some examples:
You know that the remote system breaks extremely rarely. But once its unavailable, you system doesn't work as well.
In this case you can might consider to retry the call to remote system if it failed. And if you still out of luck - then return some error status. Probably something like 5XX
You know that the data provided by remote client is not really important, on the other hand when the client calls your API its better to provide "something" even if its not really up-to-date than nothing. Think about the remote system that provides the "recommended movies" by some client id. And you're building a portal (netflix style). If this recommended movies service is down for some reason, it doesn't make sense to fail the whole portal page (think about the awful user experience). In this case you might want to "pre-cache" some generic list of movies, and use it as a fallback in case of failure of that remote service. In this case obviously you should return 2XX status in any case.
More advanced architecture. You know that the remote service fails often, and you can continue to work when its down. In this case maybe you will want to choose an "asynchronous" style of interaction with the client. For example: the client calls your rest and you respond immediately with an "Accepted" status code (202). You can save this id with status in some Database so that when the user "asks for status of the ticket by ticket id" you'll be able to query the DB. The point is that you return immediately. Then you might want to send the message with the task to some messaging system and once the consumer will pick the message, it will be processed and the db will be updated. As long as the remote service fails the message will get back to queue still being "unprocessed" (usually messaging systems can implement this behavior). Now at some point in time, the remote system starts responding, and all the messages get processed. Now their status in DB is "done".
So its up to client to ask "what happens" /or you can implement some push model with web sockets or something (its not REST style communication anymore in this case). But the point is that at some point in time the client will receive "OK, we're done with the ticket ID" (status 200). In this case the client can call a special endpoint and consume the stored results that you'll store in the DB as well (again status 200)
Bottom line, as you see, HTTP return codes are just an indicator, but its up to you how to organize the process of interconnection with the client and the relevant HTTP statuses will be derived from your decisions.

I would use 503 - Service Unavailable - as the error. Reason -
This is considering the case that the API operation cannot be completed without response from the external API. This is similar to my DB not responding. So my API is unavailable for service till the external service is back online.
As an API client, I am not concerned whether the API server internally invokes other APIs or not. I am just concerned with the result of the API server. So it does not matter to the client whether I am a proxy or not - hence, I would avoid 502 (Bad Gateway) and 504 (Gateway Timeout). These error can put the client into wrong assumption that the Gateway between the client and our service is causing trouble.
As suggested by #developerjack, I would also recommend to - "Include a Retry-After header so that your HTTP client knows not to spam you with retries until after X time. This means less error traffic for you, and better request planning for the client."

HTTP calls are between client and server, and so the error codes should reflect where the error or fault lies on either side of that relationship. Just because its downstream to you doesn't mean the HTTP client needs to care about that.
Given this, you should be returning a 5xx error because the fault is not with the client, its with the server (or its downstream services). Returning a 2xx (see below for caveat) would be incorrect because the HTTP call did not succeed, and a 4xx would be incorrect because it's not the client's fault.
Digging into specific 5xx's you can return:
A 504 or 502 might be appropriate if you specifically want to signal that your service is acting as a gateway/proxy.
A 523 is unofficial but used by cloudflare to specifically signal that an upstream/origin service is unreachable
A 500 (with a human and machine readable error body) is a safe default that simply indicates "there is something not right with the server and its services right now".
Now, in terms of best practice, there are some techniques you can use to either reduce the 500 errors, or make it easier on the clients to respond/react to this 5xx response.
Put in place retries within your service. If your service is working and the fault is downstream, and can successfully store the client's request to retry later when downstream services are available then you can still respond with a 2xx and the client knows that their request will be submitted. A great example of this might be a user sign up workflow; you can process the signup at your side, and then queue the welcome email to retry later if your email provider is unavailable.
Have both human descriptions, machine error codes and links in your API responses. Human descriptions are useful when debugging and developing against your service. Machine codes mean clients can index/track and code up specific code paths to a given scenario, and links to your docs mean you can alway provide more information. Even better is including any specific ID's for you to trace instances of this error in case the HTTP client needs to reach out for support (though this will be heavily dependant on your logging & telemetry). Here's an example:
{
"error_code": 1234,
"description": "X happened with Y because of Z.",
"learn_more": "https://dev.my.app/errors/1234",
"id": "90daa63b-f5ac-4c33-97d5-0801ade75a5d"
}
Include a Retry-After header so that your HTTP client knows not to spam you with retries until after X time. This means less error traffic for you, and better request planning for the client.

Related

What HTTP status code should server return if client request can't be fulfilled at the moment because of some business logic?

I have a chat application that works in browsers and uses REST API backend. It has following business logic rules:
a user can start chat session with any other users
a user can be in one and only one chat session at the time
if userA and userB have started a chat session and the session is
currently active, then if userC tries to start chat session with
either userA or userB server should prevent that and return some
kind of error to userC
My question is what would be appropriate HTTP status code for this error that userC should receive?
This is not fault of the client so 4xx codes don't seem appropriate.
This is not server error so 5xx codes also don't seem appropriate.
I will send response body with additional JSON info message of why request failed but what would be appropriate HTTP status code that would satisfy RESTful principles?
409 sounds the most appropriate here. 409 is often used in cases where a request is otherwise valid, but cannot be fulfilled because of the current state of the server/some other resource. If that 'other state' changes, the request could be valid again.
I wrote a bit more about 409 Conflict here: https://evertpot.com/http/409-confict
An important thing to recognize is that status codes, like headers, are meta data that belong to the transporting documents over a network domain. The audience for a status code isn't just your bespoke client, but also all of the general purpose components participating in the message exchange.
My question is what would be appropriate HTTP status code for this error that userC should receive?
The first thing to work through is the appropriate class of status code to use. For unsafe requests, the primary question to ask is "did this request change the representation of the resource?" If it did, then you want to think about using a 2xx status code, because you will want the general purpose components to be using cache invalidation to evict the now out-of-date representations in their caches. If the resource didn't change state, then you want to be reviewing the 4xx status code.
In either case, you can get a sense for the possibilities by reviewing the Status Code Registry, deciding which descriptions seem likely, then reviewing the authoritative reference to see if the semantics match what you are looking for.
More often than not, you can cheat and jump immediately to RFC 7231 -- the most familiar status codes are defined by the HTTP standard.
This is not fault of the client so 4xx codes don't seem appropriate.
It's probably the correct choice, though. 5xx is "I wanted to do what you asked, but I couldn't" is unlikely to be the right choice.
403 Forbidden is a pretty good option that says "I understood what you wanted, but I'm not going to do it. It's most commonly associated with a credentials problem, but the standard explicitly allows us to use this code elsewhere
a request might be forbidden for reasons unrelated to the credentials.
409 Conflict is a reasonable candidate.
The good news is that, aside from the human semantics, there isn't actually a lot of difference in how general purpose candidates handle these two status codes. For instance, they have exactly the same default caching behaviors.
HTTP does have a standardized treatment of conditional requests; "apply this change to the resource only if the specified predicate is true". That, in effect, gives you a compare and swap operation - you tag your request with metadata indicating which version of a resource you are looking at locally.
Conditional requests have their own special error code to handle a request that is out of date: 412 Precondition Failed. There's also a 428 Precondition Required status code if a request is missing a predicate and you want to insist. The client would be expected to include an appropriate precondition header to proceed.
As noted by Andrei Dragotoniu, status codes aren't intended to describe your domain behaviors. So you sometimes need to consider that 2xx is appropriate, because the server did what you asked, even though what you asked didn't have the outcome you hoped for.
Imagine, for example, a game on the web; you make a legal move, and the result of your move is that you lose the game. What status code should be used? Probably a 200 in that case - the server's state machine moved from playing the game to losing the game, and that's a perfectly legit outcome for a correctly handled HTTP request.
I don't guess that applies in your case; but you have more information to make that judgment.
First of all what you use is all upto you. There is no silver bullet to decide, you can think of the error code that you feel suits the usecase.
If I was writing this application, I would prefer to return 400 for this. Initiating a chat with a user which is already chatting with someone is a client's intention. If application cannot satisfy this request I would consider to return 400. And in the error message you can say that user is already in chat session with someone else.
5XX is a server side error which should not apply here because the server is still running fine. So I wouldn't use that
Some people will even return 200 and have a error field which tells the request actually failed. So all such things ends up to developer preferences.
The status code that seems more appropariate is 409 Conflict it shows exactly that the server won't accept the request because it is busy or doing something else.
for more info see https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

HTTP status code for email already confirmed

I have a server endpoint for confirming a user's email address. If the client tries to confirm again, what status code should I return?
The result should be the REST equivalent to "this was already done" or "you should not do that" or in programming terms "invalid state".
404: doesn't feel right as resource is there, but should not be used
409: doesn't feel right as it implies user can retry later, which he can't
422: doesn't feel right as input is semantically correct
501: doesn't feel right as the resource is implemented, but this is a user error
I'm using 404. But I'd like to know how others have dealt with this, and similar situations of "you could do that, but you shouldn't, and we won't allow it".
I have a server endpoint for confirming a user's email address. If the client tries to confirm again, what status code should I return?
You'll want to be keeping in mind that the HTTP status codes are metadata from the domain of transferring documents over a network. A REST API is a facade that makes our app/service/domain model look like a document transfer component.
Also, you'll want to be thinking about the fact that the network is unreliable -- how is the client expected to recover if the HTTP response is lost? because from the point of view of the client, a lost response is indistinguishable from a lost request.
Status codes are primarily metadata; the primary role that they fulfill is to communicate generic response semantics to generic clients (like browsers, or caches). When you are trying to communicate with the human being/machine intelligence running the protocol, you should be expecting to use the message-body.
2xx -- make sure there's nothing wrong with telling the client a second time that everything just worked. There are a lot of cases where this has the actual effect that you want.
410 Gone -- I think this is your best choice for "one time pad" scenarios. The server generates a unique link for some single use, and if it is used, or if some protocol timeout is exceeded, then the URI is burned never to be used again. The payload in this case would probably be a message to the client indicating that the entire protocol needs to start again, and providing links to the protocol start resources.
403 Forbidden "I'm sorry Dave, I can't do that". This is a perfectly normal way to say "No".

API: what HTTP status code to use for multiple items found error?

Suppose there is a lookup API endpoint. A response can be successful (200), not found (404), ... and in my case more than one item found is an error. Which HTTP status code can describe more than one item found error the best?
Suppose there is a lookup API endpoint. A response can be successful (200), not found (404), ... and in my case more than one item found is an error. Which HTTP status code can describe more than one item found error the best?
The server fully understands the request, but can't deliver a representation that meets its part of the contract.
So the right error code is going to be in the 5xx range.
The 5xx (Server Error) class of status code indicates that the server is aware that it has erred or is incapable of performing the requested method.
If none of the specialized 5xx error codes fits, you should use 500
The 500 (Internal Server Error) status code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request.
Michael Kropat did a good job of enumerating the options in Stop Making It Hard. He makes this interesting observation about 502
I can tell you we would have saved hours upon hours of debugging time if only we had distinguished 502 Bad Gateway (an upstream problem) instead of confusing it with 500 Internal Server Error.
The modern definition of "gateway" can be found in RFC 7230 section 2.3 (Intermediaries).
A "gateway" (a.k.a. "reverse proxy") is an intermediary that acts as an origin server for the outbound connection but translates received requests and forwards them inbound to another server or servers. Gateways are often used to encapsulate legacy or untrusted information services, to improve server performance through "accelerator" caching, and to enable partitioning or load balancing of HTTP services across multiple machines.
All HTTP requirements applicable to an origin server also apply to the outbound communication of a gateway. A gateway communicates with inbound servers using any protocol that it desires, including private extensions to HTTP that are outside the scope of this specification.
Very very roughly, 500 is "my bad", where 502/504 points a finger somewhere else.
what error code would you use for my case?
Based on what you have described, 500. That's appropriate for "my representation of this resource is corrupt."
The reasonable alternative is 502, which is appropriate for "the upstream representation of this resource is corrupt".
In either case, the audience for the error is internal (the client can't do anything to correct the problem. Your support team probably can't do anything useful with the distinction between the status codes). You could just as reasonably argue that the fact that the problem is upstream is an implementation detail of no interest to clients (so 500 everything). Alternatively, you could argue that your API is a gateway that translates received requests and forwards them inbound to another server, and therefore the status code should because the problem is in your store, not in your api.
So it comes down to things like "when tracking the number of errors we have in the api, do we want to distinguish this sort of issue from an exception being thrown internally"?
Authoritative guidance seems to be lacking; choose a method, document your justification, and ship.
That's an interesting problem. If it's an API call for a lookup and you get multiple results back, I'm actually expecting a 200 code with an results array.
But if the request itself is wrongfully formatted, so that it's not clear what it is the client is asking for, you could send a 400 bad request http status code.
Also have a look at the 207 multi-status code, as this might actually closer what you're looking for. Could you maybe provide request and response examples?

HTTP Status Code for External Dependency Error

What is the correct HTTP status code to return when a server is having issues communicating with an external API?
Say a client sends a valid request to my server A, A then queries server B's API in order to do something. However B's API is currently throwing 500's or is somehow unreachable, what status code should A return to the client? A 5* error doesn't seem correct because server A is functioning as it should, and a 4* error doesn't seem correct because the client is sending a valid request to A.
Did you consider status codes 502 and 504?
502 – The server while acting as a gateway or a proxy,
received an invalid response from the upstream server it accessed
in attempting to fulfill the request.
504 – The server, while acting as a gateway or proxy,
did not receive a timely response from the upstream server
specified by the URI (e.g. HTTP, FTP, LDAP)
or some other auxiliary server (e.g. DNS) it needed to access
in attempting to complete the request.
Of course, this would require a broad interpretation of "gateway" (implementation of interface A requiring a call to interface B), applied to the application layer. But this could be a nice way to say : "I cannot answer but it's not my fault nor yours".
Since the API relies on something that is not available, its service is unavailable as well.
I would think that the status code 503: Service Unavailable is the best fit for your situation. From the RFC description:
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
Granted, the description implies that this status code should be applied for errors on the server itself (and not to signal a problem with an external dependency). However, this is the best fit within the RFC status codes, and I wouldn't suggest using any custom status codes so anyone can understand them.
Alternatively, if your API supports a way of communicating errors (e.g. to tell the user that the ID he supplied is incorrect) you may be able to use this method to tell the user that the dependency is unavailable. This might be a little friendlier and might avoid some bug searching on the user's side, since at least some of the users won't be familiar with any status codes besides 403, 404 and maybe 500, depending on your audience.
You can refer this link.
HTTP Status 424 or 500 for error on external dependency
503 Service Unavailable looks perfect for the situation.
It's been a while that this question was asked.
Faced similar situation today. After exploring a bit, to me it makes more sense to send 424 FAILED_DEPENDENCY.

What should a RESTful client do with its POST, PUT, or DELETE request upon a server error (500)

I have a RESTful service that throws a 500 INTERNAL SERVER ERROR status upon an internal failure for a number of reasons: DB errors upon connectivity or field size, code bugs, or issues with a managed code call. The resulting unhandled exception is reported back by IIS as a 500. Is this an appropriate use of 500? It could imply "retry request" according to MSDN Common REST API Error Codes. The proper API error code I am seeking is something like "### Server will NEVER process this request until a code change is made, do not resend or you will be looping forever and DOSing my server". Would a 400 Bad Request be more appropriate? It seems as if this is indicating a malformed request syntax itself, not that the service choked.
Furthermore, what should a client do when it encounters such an error? The server does not want another RESTful operation exactly like the previous one. The user may have spent some time doing data entry. Now we have to talk them off the ledge. Perhaps they can fix it on their own and that is the best practice? What are some similar experiences developers have had and how was it solved? Thanks.
4xx errors are "something is wrong with the client, they're sending the wrong stuff".
5xx errors are "something is wrong with the server, sorry it's out sick today."
Which basically means there's nothing the client can imply from a 5xx error. It could be permanent, it could be transient, the client doesn't know.
IIS sends a 500 error because IT doesn't know what happened. If your app is blindly throwing exceptions up to the web tier, there's not much more it can do or say about it.
If the server logic somehow actually KNOWS what's wrong, and WHEN it might be fixed, it can send a 503 error, telling the client it's unavailable and a Retry-After header telling the client when it will be back.
As for a client behavior, it's sort of dependent on the clients history with the service. Maybe the service intermittently fails with 500 errors, and another request will just work. This could happen, say, if you have a set of load balanced servers. The first server they hit is sick, but perhaps not sick enough that the load balancer has taken it out of rotation yet. So, another server may be just fine -- in that case the client could just retry and see what happens.
But in the end, it's up to the client as to what to do. It could try a simple back off algorithm. Retry once or twice. Retry once immediately, then again in 10s.
Or it could just push the 500 error back to the user with a polite message "tough luck".
Only the client use cases and requirements can really dictate what it's behavior should be when the server is dead.
At the client side, we have to assume that the web-service is good, and that this either a malformed request (i.e. the user has keyed in something in-appropriate), or a network error of some kind. The method I used is to use an alert box, requesting the user to refresh the screen (F5), and try again with proper input. You may want to add "in case error persists, contact ...".