Service which provides an interface to an async service and Idempotency violation - service

Please keep in mind i have a rudimentary understanding of rest and building services. I am asking this question mostly cause i am trying to decouple a service from invoking a CLI(within the same host) by providing a front to run async jobs in a scalable way.
I want to build a service where you can submit an asynchronous job. The service should be able to tell me status of the job and location of the results.
APIs
1) CreateAsyncJob
Input: JobId,JobFile
Output: 200Ok (if job was submitted successfully)
2) GetAsyncJobStatus
Input: JobId
Output: Status(inProgress/DoesntExist/Completed/Errored)
3)GetAsyncJobOutput
Input: JobId
Output: OutputFile
Question
The second API, GetAsyncJobStatus violates the principles of idempotency.
How is idempotency preserved in such APIs where we need to update the progress of a particular job ?
Is Idempotency a requirement in such situations ?

Based on the link here idempotency is a behaviour demonstrated by an API by producing the same result during it's repeated invocations.
As per my understanding idempotency is at per API method level ( we are more concerned about what would happen if a client calls this API repeatedly). Hence the best way to maintain idempotency would be to segregate read and write operations into separate APIs. This way we can reason more throughly with the idempotent behavior of the individual API methods. Also while this term is gaining traction with RESTful services, the principles hold true even for other API systems.
In the use case you have provided the response to the API call made by the client would differ (depending upon the status of the job).Assuming that this API is read-only and does not perform any write operations on the server, the state on the server would remain the same by invoking only this API - for e.g. if there were 10 jobs in the system in varied states calling this API 100 times for a job id could result in different status every time for the job id (based on it's progress) - however the number of jobs on the server and their corresponding states would be the same.
However if this API were to be implemented in a way that would alter the state of the server in some way - then this call is not necessarily idempotent.
So keep two APIs - getJobStatus(String jobId) and updateJobStatus(String jobId). The getJobStatus is idempotent while updateJobStatus is not.
Hope this helps

Related

Rest API vs GraphQL - Idempotent

This question/answer from another user was very informative as to what idempotent means:
What is an idempotent operation?
When it comes to Rest API, since caching for GET requests can quickly be enabled if not already, if a user is wanting to fetch some examples: /users/:id or /posts/:id, they can do so as many times as they'd like and it shouldn't mutate any data.
If I'm understanding correctly, a GET request is idempotent in this case.
QUESTION
I believe Relay and Dataloader can help with GraphQL queries as far as caching, but doesn't address browser/mobile caching.
If we're talking about just the GET request portion of GraphQL, it's part of a single endpoint, what could I use tech/features or otherwise, that would address the benefits that regular http requests provide caching-wise.
Idempotence vs Caching
First of all, caching and idempotence are different things and do not necessarily relate to one another. Caching may or may not be used to implement idempotent operations - it is certainly not a requirement.
Secondly, when speaking of HTTP requests, idempotence essentially concerns the state of the server rather than its responses. An idempotent operation will have leave the server in the exact same state if performed multiple times. It does not mean that the responses returned by an idempotent operation will be the same (although they often might be).
GET requests in REST are expected to be idempotent by contract - i.e. a GET operation must not have any state-altering side-effects on the server (technically this may not always be true, but for the sake of the explanation let's assume it is). This does not mean that if you GET the same resource multiple times, you'll always get the same data back. Actually, that wouldn't make sense since resources change over time and may be deleted as well right? So caching GET responses can help with performance but has nothing to do with idempotence.
Another way to look at it is that you are querying for a resource rather than arbitrary data. A GET request is idempotent in that you'll always get the same resource and you won't modify the state of the system in any way.
Finally, a poorly (oddly?) developed GET operation on the server side might have side-effects, which would violate the REST contract and make the operation non idempotent. Simple response caching would not help in such a case.
GraphQL
I like to see GraphQL queries as equivalent to GETs in REST. This means that if you query for data in GraphQL, the resolvers must not perform any side-effects. This would ensure that performing the same query multiple times will leave the server in an unchanged state.
Just like in a simple GET you'd be querying for specific resources, although unlike in GET, GraphQL allows you to query for many instances of many different types of resources at once. Once again, that does not mean identical responses, first and foremost because the resources may change over time.
If some of your queries have side-effects (i.e. they alter the state of the resources on the server), they are not idempotent! You should probably use mutations instead of queries to achieve these side-effects. Using mutations would make it clear to the client/consumer that the operation is not idempotent and should be treated accordingly (mutation inputs may accept idempotence keys to ensure Stripe-like idempotency, but that's a separate topic).
Caching GraphQL responses
I hope that by now it's clear that caching is not required to ensure / is not what determines the idempotency of GraphQL queries. It's only used to improve performance.
If you are still interested in server-side caching options for GraphQL, there are plenty of resources. You could start by reading what Apollo Server documentation has to say on the topic. Don't forget that you can also cache database/service/etc. responses. I am not going provide any specific suggestions since, judging by your question, there's much grater confusion elsewhere.

HTTP GET for 'background' job creation and acquiring

I'm designing API for jobs scheduler. There is one scheduler with some set of resources and DB tables for them. Also there are multiple 'workers' that request 'jobs' from scheduler. Worker can't create job it must only request it. Job must be calculated on the server side. Also job is a dynamic entity and calculated using multiple DB tables and time. There is no 'job' table.
In general this system is very similar to task queue. But without queue. I need a method for worker to request next task. That task should be calculated and assigned for this agent.
Is it OK to use GET verb to retrieve and 'lock' job for the specific worker?
In terms of resources this query does not modify anything. Only internal DB state is updated. For client it looks like fetching records one by one. It doesn't know about internal modifications.
In pure REST style I probably should define a job table and CRUD api for it. Then I would need to create some auxilary service to POST jobs to that table. Then each agent would list jobs using GET and then lock it using PATCH. That approach requires multiple potential retries due to race-conditions. (Job can be already locked by another agent). Also it looks a little bit complicated if I need to assign job to specific agent based on server side logic. In that case I need to implement some check logic on client side to iterate through jobs based on different responces.
This approach looks complicated.
Is it OK to use GET verb to retrieve and 'lock' job for the specific worker?
Maybe? But probably not.
The important thing to understand about GET is that it is safe
The purpose of distinguishing between safe and unsafe methods is to
allow automated retrieval processes (spiders) and cache performance
optimization (pre-fetching) to work without fear of causing harm. In
addition, it allows a user agent to apply appropriate constraints on
the automated use of unsafe methods when processing potentially
untrusted content.
If aggressive cache performance optimization would make a mess in your system, then GET is not the http method you want triggering that behavior.
If you were designing your client interactions around resources, then you would probably have something like a list of jobs assigned to a worker. Reading the current representation of that resource doesn't require that a server change it, so GET is completely appropriate. And of course the server could update that resource for its own reasons at any time.
Requests to modify that resource should not be safe. For instance, if the client is going to signal that some job was completed, that should be done via an unsafe method (POST/PUT/PATCH/DELETE/...)
I don't have such resource. It's an ephymeric resource which is spread across the tables. There is no DB table for that and there is no ID column to update that job. That's another question why I don't have such table but it's current requirement and limitation.
Fair enough, though the main lesson still stands.
Another way of thinking about it is to think about failure. The network is unreliable. In a distributed environment, the client cannot distinguish a lost request from a lost response. All it knows is that it didn't receive an acknowledgement for the request.
When you use GET, you are implicitly telling the client that it is safe (there's that word again) to resend the request. Not only that, but you are also implicitly telling any intermediate components that it is safe to repeat the request.
If there are no adverse effects to handling multiple copies of the same request, the GET is fine. But if processing multiple copies of the same request is expensive, then you should probably be using POST instead.
It's not required that the GET handler be safe -- the standard only describes the semantics of the messages; it doesn't constraint the implementation at all. But any loss of property incurred is properly understood to be the responsibility of the server.

How to merge/consolidate responses from multiple RESTful microservices?

Let's say there are two (or more) RESTful microservices serving JSON. Service (A) stores user information (name, login, password, etc) and service (B) stores messages to/from that user (e.g. sender_id, subject, body, rcpt_ids).
Service (A) on /profile/{user_id} may respond with:
{id: 1, name:'Bob'}
{id: 2, name:'Alice'}
{id: 3, name:'Sue'}
and so on
Service (B) responding at /user/{user_id}/messages returns a list of messages destined for that {user_id} like so:
{id: 1, subj:'Hey', body:'Lorem ipsum', sender_id: 2, rcpt_ids: [1,3]},
{id: 2, subj:'Test', body:'blah blah', sender_id: 3, rcpt_ids: [1]}
How does the client application consuming these services handle putting the message listing together such that names are shown instead of sender/rcpt ids?
Method 1: Pull the list of messages, then start pulling profile info for each id listed in sender_id and rcpt_ids? That may require 100's of requests and could take a while. Rather naive and inefficient and may not scale with complex apps???
Method 2: Pull the list of messages, extract all user ids and make bulk request for all relevant users separately... this assumes such service endpoint exists. There is still delay between getting message listing, extracting user ids, sending request for bulk user info, and then awaiting for bulk user info response.
Ideally I want to serve out a complete response set in one go (messages and user info). My research brings me to merging of responses at service layer... a.k.a. Method 3: API Gateway technique.
But how does one even implement this?
I can obtain list of messages, extract user ids, make a call behind the scenes and obtain users data, merge result sets, then serve this final result up... This works ok with 2 services behind the scenes... But what if the message listing depends on more services... What if I needed to query multiple services behind the scenes, further parse responses of these, query more services based on secondary (tertiary?) results, and then finally merge... where does this madness stop? How does this affect response times?
And I've now effectively created another "client" that combines all microservice responses into one mega-response... which is no different that Method 1 above... except at server level.
Is that how it's done in the "real world"? Any insights? Are there any open source projects that are built on such API Gateway architecture I could examine?
The solution which we used for such problem was denormalization of data and events for updating.
Basically, a microservice has a subset of data it requires from other microservices beforehand so that it doesn't have to call them at run time. This data is managed through events. Other microservices when updated, fire an event with id as a context which can be consumed by any microservice which have any interest in it. This way the data remain in sync (of course it requires some form of failure mechanism for events). This seems lots of work but helps us with any future decisions regarding consolidation of data from different microservices. Our microservice will always have all data available locally for it process any request without synchronous dependency on other services
In your case i.e. for showing names with a message, you can keep an extra property for names in Service(B). So whenever a name update in Service(A) it will fire an update event with id for the updated name. The Service(B) then gets consumes the event, fetches relevant data from Service(A) and updates its database. This way even if Service(A) is down Service(B) will function, albeit with some stale data which will eventually be consistent when Service(A) comes up and you will always have some name to be shown on UI.
https://enterprisecraftsmanship.com/2017/07/05/how-to-request-information-from-multiple-microservices/
You might want to perform response aggregation strategies on your API gateway. I've written an article on how to perform this on ASP.net Core and Ocelot, but there should be a counter-part for other API gateway technologies:
https://www.pogsdotnet.com/2018/09/api-gateway-response-aggregation-with.html
You need to write another service called Aggregator which will internally call both services and get the response and merge/filter them and return the desired result. This can be easily achieved in non-blocking using Mono/Flux in Spring Reactive.
An API Gateway often does API composition.
But this is typical engineering problem where you have microservices which is implementing databases per service pattern.
The API Composition and Command Query Responsibility Segregation (CQRS) pattern are useful ways to implement queries .
Ideally I want to serve out a complete response set in one go
(messages and user info).
The problem you've described is what Facebook realized years ago in which they decided to tackle that by creating an open source specification called GraphQL.
But how does one even implement this?
It is already implemented in various popular programming languages and maybe you can give it a try in the programming language of your choice.

Batch processing via REST service

Are there any best practices for performing BATCH operations via REST for POST, PUT, PATCH verbs?
The current paradigm I am following is that the JSON payload is specified in the body for all 3 operations:
a) POST to return the location of the created resource
b) PUT / PATCH return a 201 if the update is successful
For a batch operation, I intend to accept a collection of JSON objects in the payload body but am trying to figure what to return to the client.
While processing the batch, the operation may succeed for some of the items but might fail for others.
Taking this into account, my take is that the best thing to do is to return a collection of objects indicating the Success/Failure status of each item from the payload.
But this deviates from my paradigm outlined in (a) and (b) above.
Instead, does it make sense to return an identifier representing an ID of the Batch operation itself to the client?
The client would then issue a subsequent GET to get the result of the operation it requested.
Does this approach sound reasonable? If so, does it make sense to block the client on the subsequent GET if the operation hasn't completed OR does it make sense to always return the most current state i.e. a collection of responses for each of the items that the client requested to process.
Ideas/Thoughts/Suggestions?
Since REST is an architectural style with no necessarily clear "guidelines" and no
mandate on how the actions for HTTP verbs should be implements, clearly there is no right or wrong answer here.
I am looking for a solution that is elegant, natural and intuitive.
REST operations are supposed to be atomic as seen from the outside. That is, if one part of the request fails, then the whole state of the server should revert to the pre-request state and a 4xx or 5xx response returned (so, for example, the request could be repeated in whole without ill effects if it failed the first time). This however has nothing to do with batch operations per se—such a request could be any kind of request.
Batch operations violate a different REST constraint, that of the uniform interface (defined by HTTP's methods and their operation upon a resource at the specified URL).
If you want to do batch operations, give up on trying to call your API RESTful, because you have already lost out on the benefits that REST imparts, and are just lying to yourself.
If you want to retain those benefits, give up on batch operations.
REST is an architectural style.
I implement APIs to always return the result of what was updated. So in the case of a POST it will return the created entity, with PATCH and PUT it will return the updated entity.
Depending on the batch size I would either return an array of what was processed, or alternatively an array of identifiers of what was processed.
If the batch operation is long running return an identifier for the batch, but make it painfully clear that the endpoint is different from other endpoints
ex. if you create a user by posting to
http://somesite.com/users
send batch requests to
http://somesite.com/batch/users
On a get return the status of the batch operation while it is still running, on complete return an array of record that were updated.
The most important thing is consistency, whatever you choose, always follow the same approach with batch operation throughout the system.

is there any way to determine the order of response in the batch response?

I have a question regarding to batch response.
If I query home of 20 different users(A, B,C, D,etc..) in one batch request, is there any guarantee that the response of batch will be the.
same order of the batch request (A, B,C, D,etc..)?
Because if the response order is different from the request order of users, then there is a BIG problem for my application.
or any other way that I can know which response belongs to which users, etc?
Is is safe enough using by specifying dependencies between operations in the request (for each users)? any better solutions?
Regards, Grace
Actually I think order is guaranteed, because otherwise how would you even match up your results to your requests? See this comment:
Graph Batch API
The docs just state the requests may be executed in an arbitrary order on the server. Your responses will always be in the same order, guaranteed. The whole batch concept makes no sense if not. The execution order on the server matters when the batched requests are dependent on each other in any sense. And there's a graph semantic for expressing that too. – Zahan M Sep 2 '11 at 5:01
This is probably a question that could be better answered in the facebook developers forum, but reading the documentation, it looks like the order is not guaranteed:
By default, the operations specified in the batch API request are independent - they can be executed in arbitrary order on the server and an error in one operation does not affect execution of other operations.
You could force an execution order by using dependencies (as explained in the same page), but I would rather rethink how you handle the responses instead.