Context
I'm developing a REST API that, as you might expect, is backed by multiple external cross-network services, APIs, and databases. It's very possible that a transient failure is encountered at any point and for which the operation should be retried. My question is, during that retry operation, how should my API respond to the client?
Suppose a client is POSTing a resource, and my server encounters a transient exception when attempting to write to the database. Using a combination of the Retry Pattern perhaps with the Circuit Breaker Pattern, my server-side code should attempt to retry the operation, following randomized linear/exponential back-off implementations. The client would obviously be left waiting during that time, which is not something we want.
Questions
Where does the client fit into the retry operation?
Should I perhaps provide an isTransient: true indicator in the JSON response and leave the client to retry?
Should I leave retrying to the server and respond with a message and status code indicative that the server is actively retrying the request and then have the client poll for updates? How would you determine the polling interval in that case without overloading the server? Or, should the server respond via a web socket instead so the client need not poll?
What happens if there is an unexpected server crash during the retry operation? Obviously, when the server recovers, it won't "remember" the fact that it was retrying an operation unless that fact was persisted somewhere. I suppose that's a non-critical issue that would just cause further unnecessary complexity if I attempted to solve it.
I'm probably over-thinking the issue, but while there is a lot of documentation about implementing transient exception retry logic, seldom have I come across resources that discuss how to leave the client "pending" during that time.
Note: I realize that similar questions have been asked, but my queries are more specific, for I'm specifically interested in the different options for where the client fits into a given retry operation, how the client should react in those cases, and what happens should a crash occur that interrupts a retry sequence.
Thank you very much.
There are some rules for retry:
always create an idempotency key to understand that there is retry operation.
if your operation a complex and you want to wrap rest call with retry, you must ensure that for duplicate requests no side effects will be done(start from failure point and don't execute success code).
Personally, I think the client should not know that you retry something, and of course, isTransient: true should not be as a part of the resource.
Warning: Before add retry policy to something you must check side effects, put retry policy everywhere is bad practice
Related
We have a private REST API that is locked down and only ever called by software we control, not the public. Many endpoints take a JSON payload. If deserialising the JSON payload fails (eg. the payload has an int where a Guid is expected), an exception is thrown and the API is returning a 500 Internal Server Error. Technically, it should return a 400 Bad Request in this circumstance.
Without knowing how much effort is required to ensure a 400 is returned in this circumstance, is there benefit in changing the API to return a 400? The calling software and QA are the only entities that see this error, and it only occurs if the software is sending data that doesn't match the expected model which is a critical defect anyway. I see this as extra effort and maintenance for no gain.
Am I missing something here that the distinction between 400 and 500 would significantly help with?
From a REST perspective:
If you want to follow strict REST principals, you should return 4xx as the problem is with the data being sent and not the server program
5xx are reserved for server errors. For example if the server was not able to execute the method due to site outage or software defect. 5xx range status codes SHOULD NOT be utilized for validation or logical error handling.
From a technical perspective:
The reported error does not convey useful information if tomorrow another programmer/team will work on the issue
If tomorrow you have to log your errors in a central error log, you will pollute it will wrong status codes
As a consequence, if QA decides to run reports/metrics on errors, they will be erroneous
You may be increasing your technical debt which can impact your productivity in the future. link
The least you can do is to log this issue or create a ticket if you use a tool like JIRA.
Should it matter if a call to a private REST API returns 400 or 500?
A little bit.
The status code is meta data:
The status-code element is a 3-digit integer code describing the result of the server's attempt to understand and satisfy the client's corresponding request. The rest of the response message is to be interpreted in light of the semantics defined for that status code.
Because we have a shared understanding of the status codes, general purpose clients can use that meta data to understand the broad meaning of the response, and take sensible actions.
The primary difference between 4xx and 5xx is the general direction of the problem. 4xx indicates a problem in the request, and by implication with the client
The 4xx (Client Error) class of status code indicates that the client seems to have erred.
5xx indicates a problem at the server.
The 5xx (Server Error) class of status code indicates that the server is aware that it has erred or is incapable of performing the requested method
So imagine, if you would, a general purpose reverse proxy acting as a load balancer. How might the proxy take advantage of the ability to discriminate between 4xx and 5xx.
Well... 5xx suggests that the query itself might be fine. So the proxy could try routing the request to another healthy instance in the cluster, to see if a better response is available. It could look at the pattern of 5xx responses from a specific member of the cluster, and judge whether that instance is healthy or unhealthy. It could then evict that unhealthy instance and provision a replacement.
On the other hand, with a 4xx status code, none of those mitigations make any sense - we know instead that the problem is with the client, and that forwarding the request to another instance isn't going to make things any better.
Even if you aren't going to automatically mitigate the server errors, it can still be useful to discriminate between the two error codes, for internal alarms and reporting.
(In the system I maintain, we're using general purpose monitoring that distinguishes 4xx and 5xx responses, with different thresholds to determine if I should be paged. As you might imagine, I'm rather invested in having that system be well tuned.)
I have an issue with implementing Idempotent operation in Put.
There is Put request which updates a field in a rest API resource.
But to implement Idempotency every repeated request should result in same state of the object.
We use a database what happens if an error occurs. Now does this mean Idempotency is lost? if not
Now going by the same defintion - If we have a conditional status change in a rest API on a field .eg) status field.
If the logic is to update the status field only if the parent property field locked==false, we can throw an exception saying 'BusinessLogic exception cannot update status'
So in theory we have two operations we have similar situation:
one could be idempotent if not for real life errors which cannot be
avoided
One should not be idempotent but we can make it similar
question:
How do you implement error handling based idempotency for put? and if error handling is OK does this mean even business logics could be made idempotency Put?
It may help to review the relevant definition of idempotent
We use a database what happens if an error occurs. Now does this mean Idempotency is lost?
Idempotency is not lost. Idempotent doesn't promise that every request will succeed; it only means that any loss of property that occurs because the server received multiple copies of the request is the fault of the server.
does this mean even business logics could be made idempotency Put?
Yes. You can do this in one of two ways: by designing your domain application protocol so that the requests are inherently idempotent; or by using conditional requests to describe the "before" state that the request is intending to change.
one connection send many request to server
How to process request concurrently.
Please use a simple example like timeserver or echoserver in netty.io
to illustrate the operation.
One way I could find out is to create a separate threaded handler that will be called as in a producer/consumer way.
The producer will be your "network" handler, giving message to the consumers, therefore not waiting for any wanswear and being able then to proceed with the next request.
The consumer will be your "business" handler, one per connection but possibly multi-threaded, consuming with multiple instances the messages and being able to answer using the Netty's context from the connection from which it is attached.
Another option for the consumer would be to have only one handler, still multi-threaded, but then message will come in with the original Netty's Context such that it can answear to the client, whatever the connection attached.
But the difficulties will come soon:
How to deal with an answear among several requests on client side: let say the client sends 3 requests A, B and C and the answears will come back, due to speed of the Business handler, as C, A, B... You have to deal with it, and knowing for which request the answer is.
You have to ensure all the ways the context given in parameter is still valid (channel active), if you don't want to have too many errors.
Perhaps the best way would be to however handle your request in order (as Netty does), and keep the answear's action as quick as possible.
Is it appropriate for a server to return 503 ("Service Unavailable") when the requested operation resulted in a database deadlock?
Here is my reasoning:
Initially I tried avoiding database deadlocks, but I ran across https://stackoverflow.com/a/112256/14731
Next, I tried repeating the request on the server-side, but I ran across Java Servlets: How to repeat an HTTP request?. Technically speaking I can buffer the request entity but scalability will suffer and clients are more likely to see 503 Service Unavailable anyway.
Seeing as:
It's easier to ask clients to repeat the operation.
They need to be able to handle 503 Service Unavailable anyway.
Database deadlocks are rather rare.
I'm leaning towards this solution. What do you think?
UPDATE: I think returning 503 ("Service Unavailable") is still acceptable if you wish it, but I no longer think it is technically required. See https://stackoverflow.com/a/17960047/14731.
I think semantically 409 Conflict is a better alternative - basically if you have a deadlock there's contention for some resource, and so the operation could not be completed.
Now depending on the reason for the deadlock, the request may not succeed if submitted a second time, but that's true for anything.
For a 503, as a client I'd implement some sort of back-away/circuit breaker operation as the system is rate limited, whereas 409 relates to the specific request.
Just got here with the same question and no clear answer on the issue.
a 503 is acceptable but might not be correctly interpreted
a 409 is also OK but in my case was not OK (since multiple resources could end up returning a this error for the same URL)
In my case I ended up returning a 307 redirect on the same URL.
Clients will automatically "retry" and the second call works because the resource is only raising a deadlock during its initial creation.
Be warned that might end up in an infinite loop
I think it's fine so long as the entire transaction is rolled back or if the request is idempotent.
I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."