Good practices to propagate errors through micro services - rest

We have a micro services architecture and we are having some discussions about how to expose internal errors to the client.
Here's an example:
Let's suppose we have 3 services, service A, B and C.
When the client sends a request to the service A, which is public, this service sends a request to service B that sends a request to service C (which are internal and needs authentication, but the credentials are stored internally like environment variables, they are not send by the client).
And for some reason the communication between B and C receives a 401 (could be 422, 403 or any client related errors), which means that the request was not authorized.
Something like that:
The communication between B and C is internal, the user don't know about these services. Should I expose our internal structure sending a 401 to the client? Given it's not the client's fault? Should I send a 500?

It's better to avoid exposing 500 status explicitly but in some cases it's necessary. A user works with your system not with particular service and for him it doesn't matter what is inside. Internal system implementation can vary but user interaction can stay the same.
Let's A will be for instance a e-commerce service, B - billing service and C - billing gateway. User buys a product via A which send billing request to B and B communicates with C to perform transaction. 401 between B and C can be for different reasons. If it is simply internal configuration problem (not updated password, expired certificate and so on) it is an internal system bug and you need to tell user that service is unavailable now or something like that, not to pass all internal error details of course. You can use 5xx code in this case. Perhaps service B can put request to some kind of queue and tell service A that's everything is OK, your request will be processed later. But if it is because of user tries to use bad credit card or don't have enough money (not authorized request) A needs to show the proper message and 4xx response code.
In general a service exposes resources and it doesn't matter how many internal or external services, databases, data sources and so on are behind it. Perhaps 401 between B and C means for B to go to D service (C alternate) and A service shouldn't know about 401 at all. So, it depends on what you need to expose to user and how you need to handle different cases.

Your diagram makes little sense. The incoming call is not 200 until it returns to the user successfully, after all internal services are called.
If the authentication between B and C is internal (server to server auth), then you have an internal error, and 502 is a sane choice to return to A. Of course, you might decide to retry in server A, as you got a 502 from B, but it's pointless because it's an expired token. So you may decide as policy that internal 401s should be escalated back to A. Or you may find attaching metadata in the 502 error response body assists a retrying mechanism. Anyway, server-server auth shouldn't be failing where it is a valid call.
So ... if C's authentication is working on the user's supplied token, then the user's authentication ran out during the call (rare, but happens) - in this case the token should have been extended elsewhere in the system prior to this call (probably in A's call to SSO). But it wasn't, so return 401 to where-ever in the application redirects to the login page.

Related

Should users await for a response after the http request in a saga pattern architecture?

I am designing a microservice architecture, using a database per service pattern.
Following the example of Order Service and Shipping Service, when a user makes an HTTP REST request to the Order Service, this one fires an event to notify shipping service. All this happens asynchronously. So, what happens with the user experience? I mean, the user needs an immediate response from the HTTP request. How can I handle this scenario?
All this happens asynchronously. So, What happen with the user experience? I mean, the user needs an immediately response from the HTTP request. How can I handle this scenario?
Respond as soon as you have stored the request.
Part of the point of microservices is that you have a system composed of independently deployable elements that do not require coordination.
If you want a system that is reliable even though the services don't have 100% uptime, then you need to have some form of durable message storage so that the sender and the receiver don't need to be running at the same time.
Therefore, your basic pattern for data from the outside is that the information from the incoming HTTP request is copied, not directly into a running service, but instead into the message store, to be processed by the service at some later time.
In other words, your REST API is a facade in front of your storage, not in front of the service itself.
The actor model may be a useful analogy; information moves around by copying messages into different inboxes, and are later consumed by the subscribing actor.
From the perspective of the client, the HTTP response is an acknowledgement that the request has been received and recognized as valid. Think "thank you for your order, we'll send you an email when your purchase is ready for pick up."
On the web, we would include in the response links to other useful resources; click here to see the status of your order, click there to see your history of recent orders, and so on.

REST API status when external APIs are down - Best Practices

I'm looking for guidance on good practices when it comes to returning errors from a REST API. I'm working on a new API so I can take it any direction.
In my case client invokes my API which internally invokes some external APIs. In case of success no problem, but in case of error responses from the far end(external cloud APIs) I am not sure what is industry standard for such services. Am currently thinking of returning 200 OK and then a json payload which details about the external API errors.
So what is the industry recommendations? Good practices (please explain why!) and also, from a client pov, what kind of error handling in the REST API makes life easier for the client code?
The failure you're asking about is one that has occurred within the internals of the service itself, though it is having external dependencies, so a 5XX status code range is the correct choice. 503 Service Unavailable looks perfect for the situation you've described.
5XX codes used for telling the client that even though the request was fine, the server has had some kind of problem fulfilling the request. On the other hand,
4XX codes are used to tell the client that it has done something wrong in request (and that the server is just fine, thanks).
Sections 10.4 and 10.5 of the HTTP 1.1 spec explain the different purposes of 4XX and 5XX codes.
Our colleagues have already provided the links / explanations about the HTTP status codes so you should learn them and find the most appropriate in your case.
I'll more concentrate on what can influence your decisions, assuming you've learnt the status codes.
Basically, You should understand what are the business implications of the flow triggered by client when he/she calls "your" API. The client doesn't know anything about the external cloud API you're working with and doesn't really care whether it works or not, the client works with your application.
If so, when the remote system returns some kind of error (and yes, different error statuses should give you a clue of what's wrong with the remote system), its your business decision about how to handle this error, and depending on this decision you might want to "behave" differently in the interaction with a client.
Here are some examples:
You know that the remote system breaks extremely rarely. But once its unavailable, you system doesn't work as well.
In this case you can might consider to retry the call to remote system if it failed. And if you still out of luck - then return some error status. Probably something like 5XX
You know that the data provided by remote client is not really important, on the other hand when the client calls your API its better to provide "something" even if its not really up-to-date than nothing. Think about the remote system that provides the "recommended movies" by some client id. And you're building a portal (netflix style). If this recommended movies service is down for some reason, it doesn't make sense to fail the whole portal page (think about the awful user experience). In this case you might want to "pre-cache" some generic list of movies, and use it as a fallback in case of failure of that remote service. In this case obviously you should return 2XX status in any case.
More advanced architecture. You know that the remote service fails often, and you can continue to work when its down. In this case maybe you will want to choose an "asynchronous" style of interaction with the client. For example: the client calls your rest and you respond immediately with an "Accepted" status code (202). You can save this id with status in some Database so that when the user "asks for status of the ticket by ticket id" you'll be able to query the DB. The point is that you return immediately. Then you might want to send the message with the task to some messaging system and once the consumer will pick the message, it will be processed and the db will be updated. As long as the remote service fails the message will get back to queue still being "unprocessed" (usually messaging systems can implement this behavior). Now at some point in time, the remote system starts responding, and all the messages get processed. Now their status in DB is "done".
So its up to client to ask "what happens" /or you can implement some push model with web sockets or something (its not REST style communication anymore in this case). But the point is that at some point in time the client will receive "OK, we're done with the ticket ID" (status 200). In this case the client can call a special endpoint and consume the stored results that you'll store in the DB as well (again status 200)
Bottom line, as you see, HTTP return codes are just an indicator, but its up to you how to organize the process of interconnection with the client and the relevant HTTP statuses will be derived from your decisions.
I would use 503 - Service Unavailable - as the error. Reason -
This is considering the case that the API operation cannot be completed without response from the external API. This is similar to my DB not responding. So my API is unavailable for service till the external service is back online.
As an API client, I am not concerned whether the API server internally invokes other APIs or not. I am just concerned with the result of the API server. So it does not matter to the client whether I am a proxy or not - hence, I would avoid 502 (Bad Gateway) and 504 (Gateway Timeout). These error can put the client into wrong assumption that the Gateway between the client and our service is causing trouble.
As suggested by #developerjack, I would also recommend to - "Include a Retry-After header so that your HTTP client knows not to spam you with retries until after X time. This means less error traffic for you, and better request planning for the client."
HTTP calls are between client and server, and so the error codes should reflect where the error or fault lies on either side of that relationship. Just because its downstream to you doesn't mean the HTTP client needs to care about that.
Given this, you should be returning a 5xx error because the fault is not with the client, its with the server (or its downstream services). Returning a 2xx (see below for caveat) would be incorrect because the HTTP call did not succeed, and a 4xx would be incorrect because it's not the client's fault.
Digging into specific 5xx's you can return:
A 504 or 502 might be appropriate if you specifically want to signal that your service is acting as a gateway/proxy.
A 523 is unofficial but used by cloudflare to specifically signal that an upstream/origin service is unreachable
A 500 (with a human and machine readable error body) is a safe default that simply indicates "there is something not right with the server and its services right now".
Now, in terms of best practice, there are some techniques you can use to either reduce the 500 errors, or make it easier on the clients to respond/react to this 5xx response.
Put in place retries within your service. If your service is working and the fault is downstream, and can successfully store the client's request to retry later when downstream services are available then you can still respond with a 2xx and the client knows that their request will be submitted. A great example of this might be a user sign up workflow; you can process the signup at your side, and then queue the welcome email to retry later if your email provider is unavailable.
Have both human descriptions, machine error codes and links in your API responses. Human descriptions are useful when debugging and developing against your service. Machine codes mean clients can index/track and code up specific code paths to a given scenario, and links to your docs mean you can alway provide more information. Even better is including any specific ID's for you to trace instances of this error in case the HTTP client needs to reach out for support (though this will be heavily dependant on your logging & telemetry). Here's an example:
{
"error_code": 1234,
"description": "X happened with Y because of Z.",
"learn_more": "https://dev.my.app/errors/1234",
"id": "90daa63b-f5ac-4c33-97d5-0801ade75a5d"
}
Include a Retry-After header so that your HTTP client knows not to spam you with retries until after X time. This means less error traffic for you, and better request planning for the client.

Http Status Code When Downstream Validation Fails

I have an API that charges for an order. It accepts the orderId and the amount as inputs. Then it makes a '/charge' call to the downstream, which returns a 202. Immediately after this call, it calls a '/verify' endpoint to make sure that the previous charge was successful.
Now it may happen that the charge was declined. One of the reasons for this can be that the user used an expired card. What should be the error code in this scenario?
As I see it, I can't send a 4xx as the request was correct for my API perspective. A bad request is something that the user can correct - In this case, he can't correct anything since the API just accepts the 'orderId' and the total amount to charge.
If I am sending a 5XX, then 500 does not make sense as this was not an 'unexpected condition' on my server. I can neither send a 503 as my server was not overloaded or down for maintenance.
Currently, I am sending back a 503 with an app code that maps to: Payment verification failed.
The response of the server must always be in the context of the domain responsibility of the service
If the service "accepts" the request and that is all the requester (client) is expected to know, with the domain operation being performed asynchronously behind the scenes, it should return a 202
If the interaction is synchronous, you must surely respond with an error, since the request was unsuccessful.
The response code depends on the domain remits.
As per your service, if the api accepts an identifier in the request, that led to the failed payment and it was the responsibility of the client to pass the right identifier, then you must respond with a 400 - BAD REQUEST.
If however, the api is just an intimation from the client to request you to perform some domain actions, and one of the domain actions failed; then there is nothing the client can do about it, and you must return a 5XX, since it is a service failure
500 is generally used for ungraceful error scenarios as a rule of thumb. But if you are ok to term this a server error, then return a 500
502 - is a BAD GATEWAY, wherein your domain services acting as a proxy for your downstream services failed to perform a domain action.
Please choose what fits

Orion Federation Security

Let's suppose a scenario where we have two Orions deployed in different cloud infrastructures. These Orions are behind a PEP (Wilma) each, with no posibilty of accesing them without autentication.
Would it be possible to federate these Orions through the PEP?
If the Orion2 (O2) is the context provider of Orion1 (O1) when an O1 user request a queryContext to O1 will the access token propagate to O2?
Yes, the access token in x-auth-header is propagated in forwarded requests so the scenario you propose (named "pull" federation) should work.
EDIT: is the CB receiving the propagated request uses a PEP governed by the same security framework of the sending CB (in other words, the PEP of both CBs share the same IDM/AC instance) then federation is automatic. If the CBs doesn't share the same IDM/AC instance then you need some piece in the middle (a proxy) able to translate the x-auth-header valid in the sending CB to a x-auth-token valid in the receiving CB (the proxy should interact with the IDM governing the receiving CB).

HTTP Status Code for External Dependency Error

What is the correct HTTP status code to return when a server is having issues communicating with an external API?
Say a client sends a valid request to my server A, A then queries server B's API in order to do something. However B's API is currently throwing 500's or is somehow unreachable, what status code should A return to the client? A 5* error doesn't seem correct because server A is functioning as it should, and a 4* error doesn't seem correct because the client is sending a valid request to A.
Did you consider status codes 502 and 504?
502 – The server while acting as a gateway or a proxy,
received an invalid response from the upstream server it accessed
in attempting to fulfill the request.
504 – The server, while acting as a gateway or proxy,
did not receive a timely response from the upstream server
specified by the URI (e.g. HTTP, FTP, LDAP)
or some other auxiliary server (e.g. DNS) it needed to access
in attempting to complete the request.
Of course, this would require a broad interpretation of "gateway" (implementation of interface A requiring a call to interface B), applied to the application layer. But this could be a nice way to say : "I cannot answer but it's not my fault nor yours".
Since the API relies on something that is not available, its service is unavailable as well.
I would think that the status code 503: Service Unavailable is the best fit for your situation. From the RFC description:
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
Granted, the description implies that this status code should be applied for errors on the server itself (and not to signal a problem with an external dependency). However, this is the best fit within the RFC status codes, and I wouldn't suggest using any custom status codes so anyone can understand them.
Alternatively, if your API supports a way of communicating errors (e.g. to tell the user that the ID he supplied is incorrect) you may be able to use this method to tell the user that the dependency is unavailable. This might be a little friendlier and might avoid some bug searching on the user's side, since at least some of the users won't be familiar with any status codes besides 403, 404 and maybe 500, depending on your audience.
You can refer this link.
HTTP Status 424 or 500 for error on external dependency
503 Service Unavailable looks perfect for the situation.
It's been a while that this question was asked.
Faced similar situation today. After exploring a bit, to me it makes more sense to send 424 FAILED_DEPENDENCY.