Is it valid to modify a REST API representation based on a If-Modified-Since header? - rest

I want to implement a "get changed values" capability in my API. For example, say I have the following REST API call:
GET /ws/school/7/student
This gets all the students in school #7. Unfortunately, this may be a lot. So, I want to modify the API to return only the student records that have been modified since a certain time. (The use case is that a nightly process runs from another system to pull all the students from my system to theirs.)
I see http://blog.mugunthkumar.com/articles/restful-api-server-doing-it-the-right-way-part-2/ recommends using the if-modified-since header and returning a representation as follows:
Search all the students updated since the time requested in the if-modified-since header
If there are any, return those students with a 200 OK
If there are no students returned from that query, return a 304 Not Modified
I understand what he wants to do, but this seems the wrong way to go about it. The definition of the If-Modified-Since header (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.24) says:
The If-Modified-Since request-header field is used with a method to make it conditional: if the requested variant has not been modified since the time specified in this field, an entity will not be returned from the server; instead, a 304 (not modified) response will be returned without any message-body.
This seems wrong to me. We would not be returning the representation or a 304 as indicated by the RFC, but some hybrid. It seems like client side code (or worse, a web cache between server and client) might misinterpret the meaning and replace the local cached value, when it should really just be updating it.
So, two questions:
Is this a correct use of the header?
If not (and I suspect not), what is the best practice? Query string parameter?

This is not the correct use of the header. The If-Modified-Since header is one which an HTTP client (browser or code) may optionally supply to the server when requesting a resource. If supplied the meaning is "I want resource X, but only if it's changed since time T." Its purpose is to allow client-side caching of resources.
The semantics of your proposed usage are "I want updates for collection X that happened since time T." It's a request for a subset of X. It does not seem like your motivation is to enable caching. Your client-side cached representation seemingly contains all of X, even though the typical request will only return you a small set of changes to X; that is, the response is not what you are directly caching, so the caching needs to happen in custom user logic client-side.
A query string parameter is a much more appropriate solution. Below {seq} would be something like a sequence number or timestamp.
GET /ws/schools/7/students/updates?since={seq}
Server-side I imagine you have a sequence of updates since the beginning of your system and a request of the above form would grab the first N updates that had a sequence value greater than {seq}. In this way, if a client ever got very far behind and needed to catch up, the results would be paged.

Related

RESTful - How to update a subresource and what ETag/payload to return concerning optimistic locking?

As an example I have an order where the invoicing address can be modified. The change might trigger various additional actions (i.e. create a cancellation invoice and a new invoice with the updated address).
As recommended by various sources (see below) I don't want to have a PATCH on the order resource, because it has many other properties, but want to expose a dedicated endpoint, also called "intent" resource or subresource according to the web links below:
/orders/{orderId}/invoicing-address
Should I use a POST or a PATCH against this subresource?
The invoicing address itself has no ID. In the domain layer it is represented as a value object that is part of the order entity.
What ETag should be used for the subresource?
The address is part of the order and together with the items they form an aggregate in the domain layer. When the aggregate is updated it gets a new version number in the database. That version number is used as an ETag for optimistic locking.
Should a GET on invoicing-address respond with the order aggregate version number or a hash value of the address DTO in the ETag header?
What payload should be returned after updating the address?
Since the resource is the invoicing address it seems natural to return the updated address object (maybe with server side added fields). Should the body also include the ID/URI and the ETag of the order resource?
None of the examples I found with subresources showed any server responses or considered optimistic locking.
https://rclayton.silvrback.com/case-against-generic-use-of-patch-and-put
https://www.thoughtworks.com/insights/blog/rest-api-design-resource-modeling
https://softwareengineering.stackexchange.com/questions/371273/design-update-properties-on-an-entity-in-a-restful-resource-based-api (see provided answer)
https://www.youtube.com/watch?v=aQVSzMV8DWc&t=188s (Jim Webber at about about 31 mins)
As far as REST is concerned, "subresources" aren't a thing. /orders/12345/invoicing-address identifies a resource. The fact that this resource has a relationship with another resource identified by /orders/12345 is irrelevant.
Thus, the invoicing-address resource should understand HTTP methods exactly the same way as every other resource on the web.
Should I use a POST or a PATCH against this subresource?
Use PUT/PATCH if you are proposing a direct change to the representation of the resource. For example, these are the HTTP methods we would use if we were trying to fix a spelling error in an HTML document (PUT if we were sending a complete copy of the HTML document; PATCH if we were sending a diff).
PUT /orders/12345/invoicing-address
Content-Type: text/plain
1060 W Addison St.
Chicago, IL
60613
On the other hand, if you are proposing an indirect change to the representation of the resource (the request shows some information to the server, and the server is expected to compute a new representation itself)... well, we don't have a standardized method that means exactly that; therefore, we use POST
POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.” -- Fielding, 2009
What ETag should be used for the subresource?
You should first give some thought to whether you want to use a strong-validator or a weak validator
A strong validator is representation metadata that changes value whenever a change occurs to the representation data that would be observable in the content of a 200 (OK) response to GET.
...
In contrast, a weak validator is representation metadata that might
not change for every change to the representation data.
...
a weak entity-tag ought to change whenever the origin server wants caches to invalidate old responses.
I might use a weak validator if the representation included volatile but insignificant information; I don't need clients to refresh their copy of a document because it doesn't have the latest timestamp metadata. But I probably wouldn't use an "aggregate version number" if I expected the aggregate to be changing more frequently than the invoicing-address itself changes.
What payload should be returned after updating the address?
See 200 OK.
In the case of a POST request, sending the current representation of the resource (after changes have been made to it) is nice because the response is cacheable (assuming you include the appropriate metadata signals in the response headers).
Responses to PATCH have similar rules to POST (see RFC 5789).
PUT is the odd man out, here
Responses to the PUT method are not cacheable.
Should the body also include the ID/URI and the ETag of the order resource?
Entirely up to you - HTTP components aren't going to be paying attention to the representation, so you can design that representation as makes sense to you. On the web, it's perfectly normal to return HTML documents with links to other HTML documents.

Response body when PATCHing a collection

In my REST API I have a very large collection; it contains millions of items. The path for this collection is /mycollection
Because this collection is so large it is not good practice to GET the whole collection so the API supports paging. Paging will be the primary way of getting the collection
GET /mycollection?page=1&page-size=100 HTTP/1.1
Say the original collection contains 1,000,000 items and I want to update 5,000, delete 3,000 and add 2,000 items. I could write my API to support updating the collection via either the PUT method or the PATCH method. While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource, i.e. all 999,000 items in the collection.
As I mentioned earlier GETting the entire collection is just not realistic; it's too big. For the same reason I don't want PUTting or PATCHing to return the entire collection. Adding query parameters to a PUT or PATCH request wouldn't work either because neither PUT nor PATCH are safe methods.
So what would be the proper response in this large collection scenario?
I could respond with
HTTP/1.1 202 Accepted
Location: /mycollection?page=1&page-size=100
The 202 Accepted response code doesn't feel correct because the update would have been done synchronously. The Location header doesn't quite feel right either. I could maybe go with a Links header, but still it doesn't feel right.
Again I ask what would be the proper response in this large collection scenario?
This question is based on a misconception:
While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource
Either can just return 204 No Content or 200 OK and no response body. There's no requirement that they include the full representation in the response.
You could optionally support this (perhaps along with the Prefer: return=representation header, or perhaps Content-Location header), but without this header I would say it's not even a convention that the current representation is returned. Generic clients shouldn't assume that the response body is the new representation unless these headers are used.
So, just return a 2xx and you're good to go.
So what would be the proper response in this large collection scenario?
Short version: you should probably treat a successful PUT as though it were a successful POST.
the intended meaning of the payload can be summarized as:
a representation of the status of, or results obtained from, the action
So the response could be as simple as
200 OK
Content-Type: text/plain
It worked!
Longer answer:
While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource
This isn't right - If you review RFC 7231, you'll see that the response to PUT has this description
a representation of the status of the action
Returning the new representation of the resource is an edge case, not the default (see the specification of the Content-Location header).
For a state-changing request like PUT (Section 4.3.4) or POST (Section 4.3.3), it implies that the server's response contains the new representation of that resource, thereby distinguishing it from representations that might only report about the action (e.g., "It worked!"). This allows authoring applications to update their local copies without the need for a subsequent GET request.
That said, I'd suggest a review of your choice of method token. Both PUT and PATCH support remote authoring semantics - messages that ask a server to make its copy of a document look like your local copy. That's why, for example, the PUT specification has a bunch of constraints about adding validator header fields to the response. General purpose components are allowed to assume that they know what's going on, because all resources are supposed to understand these methods the same way.
But in your case, you can't really be said to be remote authoring the collection, because the client (and the general purpose components) don't have a representation of the collection, but instead only representations of pages of the collection.
If you were going to be consistent with the uniform interface, then you would either
allow remote authoring of the pages, or
abandon the method tokens that imply remote authoring
It is okay to use POST when the semantics of your request don't quite align with the standardized meanings
POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.”

Is using the If-Modified-Since header to filter a resource collection to only recent ones in a REST API considered a valid approach?

I'm designing a REST API where I have a need to provide the option to GET only the resources in a collection that were created or modified recently, based on a client-provided timestamp (which, in turn, will have been generated by the API in a previous response). I'm considering the use of the Last-Modified and If-Modified-Since headers for this purpose.
Earlier questions here (like Is it valid to modify a REST API representation based on a If-Modified-Since header?) seems to indicate that this is frowned upon, on the grounds that RFC2616 indicates that the purpose of these headers is related to caching. However, since then, RFC2616 has been superseded by RFC7232, which states that
If-Modified-Since is typically used for two distinct purposes: 1) to allow efficient updates of a cached representation that does not have an entity-tag and 2) to limit the scope of a web traversal to resources that have recently changed.
My interpretation is that my use case of allowing retrieval of all changes to the collection since the last retrieval is covered by the second purpose.
So I have two questions:
Is this interpretation correct, or am I missing something subtle here?
Even if my interpretation is correct, does that make it a good practice to use these headers in this way? In other words: what other reasons would there be to not use these headers after all and instead, for example, include a timestamp in the response and allow the client to provide that back in the query string for the next request?
Is this interpretation correct, or am I missing something subtle here?
I believe RFC 7234 contradicts your interpretation.
If an If-None-Match header field is not present, a request containing an If-Modified-Since header field (Section 3.3 of [RFC7232]) indicates that the client wants to validate one or more of its own stored responses by modification date. A cache recipient SHOULD generate a 304 (Not Modified) response (using the metadata of the selected stored response) if one of the following cases is true....
The broad problem here is that a general purpose cache isn't going to know that your resource / your server have a different understanding of what the standard headers mean, and therefore clients are not going to have the experience you want.
Furthermore...
I'm designing a REST API where I have a need to provide the option to GET only the resources in a collection that were created or modified recently, based on a client-provided timestamp (which, in turn, will have been generated by the API in a previous response).
We already have a standardized mechanism for this - it's the URI. That may become clearer if you review Fielding's definition of resource.
I understand it this way: "resource", within the context of REST, is a generalization of "document" (see also Jim Webber, 2011). It's perfectly reasonable to have many different documents derived from the same (or overlapping) information.
Think "paging" - every page is a different document, with its own unique identifier, but all of the pages are being derived from the same common source, with items moving from one page to another over time.

Should this GET call return 204 or 200 with a body?

Say we have a rest api which aggregates data about a Person from other services. One of the Aggregator service routes is GET /person/(person id)/driverinfo which tells us whether the person is a licensed driver or not, license id, expiry date of license and the number of traffic violations. These data can be picked up by the Aggregator from one or more other services. This api will be used by a web page to show the "driver info" about a person. It will also be tested with automation.
Currently, the api gives 204 no content response for persons who never had a driving license. This is because one of the underlying apis gives a 204 for that scenario. So, it was decided that the Aggregator should do the same.
But, I believe that this is not a good response. Instead, we should return 200 with appropriate values for the fields. For example, licensed=false, licenseId = N.A. etc. when the underlying api gives a 204. I.e. the Aggregator should generate these fields and their values.
Which approach do you think is better and why ?
204 means something specific in HTTP; it says that the server found a representation of the requested resource, and that representation is zero bytes long.
Therefore, the real question is more like "Should we use a zero byte long message to describe a situation?". Maybe? If all of the fields in your message schema are optional, and we are trying to describe a representation that means that all of the fields are taking on their default values, then a zero byte array might be the right way to communicate that.
Within the context of HTTP specifically, the headers themselves are already significant in length (compared to zero), so I wouldn't expect there to be particularly compelling performance reasons to squeeze a signal down to zero length. For instance, if we were normally passing around application/json, I would expect that sending an empty object or array to be much more cost effective than sending nothing at all.

HTTP Response 412 - can you include content?

I am building a RESTful data store and leveraging Conditional GET and PUT. During a conditional PUT the client can include the Etag from a previous GET on the resource and if the current representation doesn't match the server will return the HTTP status code of 412 (Precondition Failed). Note this is an Atom based server/protocol.
My question is, when I return the 412 status can I also include the new representation of the resource or must the user issue a new GET? The HTTP spec doesn't seem to say yes or no and neither does the Atom spec (although their example shows an empty entity body on the response). It seems pretty wasteful not to return the new representation and make the client specifically GET it. Thoughts?
Although conditional GETs and PUTs are summarized as 'conditional requests' they are very different conceptually. Conditional GETs are a performance optimization and conditional PUTs are a concurrency control mechanism. It is hard to discuss them together.
To your question regarding the conditional GET: If you send GET and include an If-None-Match header the server will send 200 Ok if the resource has changed and 304 Not Modified if it did not (if the condition failed). 412 is only to be used with conditional PUTs.
UPDATE: It seems I misread the question slightly. To your point regarding the 'refresh' of the local copy upon a failed conditional PUT: It might well be that a cache already has the newest version and that your refresh-GET will be served from some cache. Having the server return the current entity with the 412 might actually give you worse performance.
No, technically you should not. Error codes are generally to specify that something has gone wrong. Although nothing would stop you from returning content (and in fact, some errors like a 404 return a pretty page that says You didn't find what you're looking for), the point of the response is not to return other content, but to return something that tells you what was wrong. Technically you also should not return that data because you passed the If-None-Match: etag (I'm assuming that's what you passed?)
On another note, do you really need to optimize away one additional http call?
The more I think about this, the more I'm convinced it's a bad idea - Are you going to return the content on any other errors? PUT semantics are to PUT. GET semantics should be used for GET.
If the number of additional requests incurred, due to an extra request after an update conflict, is significant enough for you to have performance concerns, then I would suggest you might have issues with the granularity of your resources.
Do you really expect millions of times a day multiple users will be editing the same resource simultaneously? Maybe you need to be storing delta changes to the resource instead updating the resource directly. If there really is that much contention for these resources then aren't users going to be constantly working on out of date data.
If your problem was that your resource contains the last-modified date and last-modified user and you had to do a GET after every PUT then I would be more convinced of the need to twist the rules.
However, I think the performance hit of the extra request is worth it for the clarity to the client developer.